data.table 之 添加或删除列
测试开头



测试结尾
要仰望,会平视,多俯看

1引言
这节主要介绍怎么在 data.table
里使用特殊操作符 :=
添加或删除列操作,及结合 i
或 by
进行组合操作。
2读取数据
还是读取之前下载的测试数据:
flights <- fread("flights14.csv")
flights
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# ---
# 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14
# 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8
# 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11
# 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
# 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
dim(flights)
# [1] 253316 11
3介绍
构建测试 data.frame :
DF = data.frame(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DF
# ID a b c
# 1 b 1 7 13
# 2 b 2 8 14
# 3 b 3 9 15
# 4 a 4 10 16
# 5 a 5 11 17
# 6 c 6 12 18
操作列:
# (1) -- replace entire column
DF$c <- 18:13
# or
# (2) -- subassign in column 'c'
DF$c[DF$ID == "b"] <- 15:13
:= 操作符使用形式
第一种形式:
DT[, c("colA", "colB", ...) := list(valA, valB, ...)]
# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]
第二种形式:
DT[, `:=`(colA = valA, # valA is assigned to colA
colB = valB, # valB is assigned to colB
...
)]
4添加/更新/删除列
添加列
flights[, `:=`(speed = distance / (air_time/60), # speed in mph (mi/h)
delay = arr_delay + dep_delay)] # delay in minutes
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed delay
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 27
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 10
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 11
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 -34
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 3
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 4
## alternatively, using the 'LHS := RHS' form
# flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]
对特定列行赋值
查看列内容:
# get all 'hours' in flights
flights[, sort(unique(hour))]
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
把 24 替换为 0:
# subassign by reference
flights[hour == 24L, hour := 0L]
加上 [] 查看内容:
flights[hour == 24L, hour := 0L][]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857
# ---
# 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866
# 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444
# 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663
# 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000
# 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545
# delay
# 1: 27
# 2: 10
# 3: 11
# 4: -34
# 5: 3
# ---
# 253312: -29
# 253313: -19
# 253314: 8
# 253315: 11
# 253316: -4
验证一下:
# check again for '24'
flights[, sort(unique(hour))]
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
删除列
flights[, c("delay") := NULL]
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363
## or using the functional form
# flights[, `:=`(delay = NULL)]
只有一列可以这样写:
flights[, delay := NULL]
5结合 by 分组
flights[, max_speed := max(speed), by = .(origin, dest)]
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507
6添加多列
按分组对子集取最大值并添加两列:
in_cols = c("dep_delay", "arr_delay")
out_cols = c("max_dep_delay", "max_arr_delay")
flights[, c(out_cols) := lapply(.SD, max), by = month, .SDcols = in_cols]
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507
# max_dep_delay max_arr_delay
# 1: 973 996
# 2: 973 996
# 3: 973 996
# 4: 973 996
# 5: 973 996
# 6: 973 996
也可以将多列删除:
# RHS gets automatically recycled to length of LHS
flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL]
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:

老俊俊微信:
知识星球:
所以今天你学习了吗?
欢迎小伙伴留言评论!
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀基于 featureCounts 原理提取基因非冗余外显子长度
◀python 学习之 featureCounts 软件的基因长度是怎么算的?
◀...
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!