data.table 之二级索引和自动索引
测试开头




测试结尾
像弱者一样感受世界

1引言
前面介绍了通过 setkey
来对数据进行快速提取, 此外还有另外一种 setindex
的方式来提取,今天来介绍一下它们之间区别及后者的使用。
2读取数据
flights <- fread("flights14.csv")
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
dim(flights)
# [1] 253316 11
3Secondary indices
是什么
两个主要区别:
-
不会对数据重新排序,会把顺序信息储存到 order vector 里, 此外 data.table 会多一个 index 的属性。 -
可以有多个二级索引。
用法
setindex(flights, origin)
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
## alternatively we can provide character vectors to the function 'setindexv()'
# setindexv(flights, "origin") # useful to program with
# 'index' attribute added
names(attributes(flights))
# [1] "names" "row.names" "class" ".internal.selfref"
# [5] "index"
获取设置索引的列名:
indices(flights)
# [1] "origin"
setindex(flights, origin, dest)
indices(flights)
# [1] "origin" "origin__dest"
为什么需要二级索引
setkey 方法:
## not run
setkey(flights, origin)
flights["JFK"] # or flights[.("JFK")]
以上需要两个步骤:
计算顺序向量 重排数据
虽然很快,但是第二步骤也是最耗时的步骤。setindex 需要搭配 on
参数,以下是此参数的优势:

4基于二级索引,使用 on 参数提取子集
flights["JFK", on = "origin"]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
# ---
# 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17
# 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18
# 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17
# 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9
# 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11
## alternatively
flights[.("JFK"), on = "origin"] (or)
flights[list("JFK"), on = "origin"]
如果已经用 setindex 建立了二级索引, 再用 on 参数则会直接使用它而不会重新计算,使用 verbose 打印信息:
setindex(flights, origin)
flights["JFK", on = "origin", verbose = TRUE][1:5]
# i.V1 has same type (character) as x.origin. No coercion needed.
# on= matches existing index, using index
# Starting bmerge ...
# forder.c received 1 rows and 1 columns
# bmerge done in 0.000s elapsed (0.001s cpu)
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
提取多个条件:
flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
结合 j 返回指定列:
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
# arr_delay
# 1: 1
# 2: 14
# 3: -17
# 4: -4
# 5: -12
# ---
# 1848: 39
# 1849: -24
# 1850: -12
# 1851: 21
# 1852: -11
链式操作:
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
# arr_delay
# 1: 1
# 2: 14
# 3: -17
# 4: -4
# 5: -12
# ---
# 1848: 39
# 1849: -24
# 1850: -12
# 1851: 21
# 1852: -11
计算 j :
flights[.("LGA", "TPA"), max(arr_delay), on = c("origin", "dest")]
# [1] 486
赋值:
# get all 'hours' in flights
flights[, sort(unique(hour))]
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
flights[.(24L), hour := 0L, on = "hour"]
检查:
flights[, sort(unique(hour))]
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
分组计算:
ans <- flights["JFK", max(dep_delay), keyby = month, on = "origin"]
head(ans)
# month V1
# 1: 1 881
# 2: 2 1014
# 3: 3 920
# 4: 4 1241
# 5: 5 853
# 6: 6 798
结合 mult 参数:
flights[c("BOS", "DAY"), on = "dest", mult = "first"]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 3 1 AA JFK BOS 39 187 12
# 2: 2014 1 1 25 35 EV EWR DAY 102 533 17
flights[.(c("LGA", "JFK", "EWR"), "XNA"), on = c("origin", "dest"), mult = "last"]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6
# 2: NA NA NA NA NA <NA> JFK XNA NA NA NA
# 3: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6
结合 nomatch 参数:
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"), nomatch = NULL]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6
# 2: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6
5自动建索引
使用 ==
和 %in%
操作符会自动建索引。
构建数据:
set.seed(1L)
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
print(object.size(dt), units = "Mb")
# 114.4 Mb
## have a look at all the attribute names
names(attributes(dt))
# [1] "names" "row.names" "class" ".internal.selfref"
## run thefirst time
(t1 <- system.time(ans <- dt[x == 989L]))
# user system elapsed
# 0.538 0.015 0.097
head(ans)
# x y
# 1: 989 0.7757157
# 2: 989 0.6813302
# 3: 989 0.2815894
# 4: 989 0.4954259
# 5: 989 0.7885886
# 6: 989 0.5547504
## secondary index is created
names(attributes(dt))
# [1] "names" "row.names" "class" ".internal.selfref"
# [5] "index"
indices(dt)
# [1] "x"
可以看的多了 index 属性,一旦有了二级索引属性,后面再次操作将会快很多:
## successive subsets
(t2 <- system.time(dt[x == 989L]))
# user system elapsed
# 0.001 0.001 0.001
system.time(dt[x %in% 1989:2012])
# user system elapsed
# 0.001 0.000 0.001

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:

老俊俊微信:
知识星球:
所以今天你学习了吗?
欢迎小伙伴留言评论!
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀基于 featureCounts 原理提取基因非冗余外显子长度
◀python 学习之 featureCounts 软件的基因长度是怎么算的?
◀...
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!