R—data.table介紹以及例子
相比dplyr包,data.table包能夠更大程度地提高數據的處理速度,這裏就簡單介紹一下data.tale包的使用方法。
data.table:用於快速處理大數據集的
1.數據的讀取
data.table包中數據讀取的函數:fread()
2.data.table的創建
> library(data.table)
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
>
3.基礎操作
(1)行提取
行提取分爲單行提取和多行提取。
單行提取
> DT[2] # 2nd row
x y v
1: a 3 2
> DT[2,] # same
x y v
1: a 3 2
>
這裏DT [2]和DT [2]是完全相同的,這裏的「,」只是說明還有其他參數可設置,而其他參數按默認值進行計算。下文所有這樣的最後一個「,」都不再寫出來。
多行提取
數字提取
> DT[1:2]
x y v
1: a 1 1
2: a 3 2
> DT[c(2,5)]
x y v
1: a 3 2
2: b 3 5
>
(2)邏輯提取
DT[c(FALSE,TRUE)] # even rows (usual recycling)
x y v
1: a 3 2
2: b 1 4
3: b 6 6
4: c 3 8
此時,C(FALSE,TRUE)會自己重複匹配成與DT的行數相同的向量
(3)列提取
與行提取相同,列的提取也包含單列提取和多列提取。
單列提取
數字提取
數字提取時,一定要把問心無愧參數設置爲FALSE。
> DT[,2,with=FALSE] # 2nd column
y
1: 1
2: 3
3: 6
4: 1
5: 3
6: 6
7: 1
8: 3
9: 6
> DT[,2,with=TRUE] # 2nd column
y
1: 1
2: 3
3: 6
4: 1
5: 3
6: 6
7: 1
8: 3
9: 6
>
按照列名提取
> DT[,list(v)] # v column (as data.table)
v
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
(4)列名的修改
列名的修改可以使用setnames()函數,這個函數好像比對data.frame類型數據名更改的名稱()和colnames()函數也要快一些。
> dt = data.table(a=1:2,b=3:4,c=5:6) # compare to data.table
> dt
a b c
1: 1 3 5
2: 2 4 6
> try(tracemem(dt)) # by reference, no deep or shallow copies
[1] "<000000001C613E10>"
> dt
a b c
1: 1 3 5
2: 2 4 6
> setnames(dt,"b","B") # by name, no match() needed (warning if "b" is missing)
> dt
a B c
1: 1 3 5
2: 2 4 6
> setnames(dt,3,"C") # by position with warning if 3 > ncol(dt)
> dt
a B C
1: 1 3 5
2: 2 4 6
> setnames(dt,2:3,c("D","E")) # multiple
> dt
a D E
1: 1 3 5
2: 2 4 6
> setnames(dt,c("a","E"),c("A","F")) # multiple by name (warning if either "a" or "E" is missing)
> dt
A D F
1: 1 3 5
2: 2 4 6
> setnames(dt,c("X","Y","Z")) # replace all (length of names must be == ncol(DT))
> dt
X Y Z
1: 1 3 5
2: 2 4 6
>
多列提取
數字提取
如同上面對按數字對單列的提取,對多列提取也要設置與參數爲FALSE。
> DT[,2:3,with=FALSE]
y v
1: 1 1
2: 3 2
3: 6 3
4: 1 4
5: 3 5
6: 6 6
7: 1 7
8: 3 8
9: 6 9
> DT[,c(1,3),with=FALSE]
x v
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
6: b 6
7: c 7
8: c 8
9: c 9
>
按列名提取
> DT[,list(y, v)]
y v
1: 1 1
2: 3 2
3: 6 3
4: 1 4
5: 3 5
6: 6 6
7: 1 7
8: 3 8
9: 6 9
>
如果按列名提取時,不使用列表,仍然能對列進行提取,只是結果以向量的形式輸出。
> DT[,v] # v column (as vector)
[1] 1 2 3 4 5 6 7 8 9
> DT[,c(v)] # same
[1] 1 2 3 4 5 6 7 8 9
> DT[, c(y, v)]
[1] 1 3 6 1 3 6 1 3 6 1 2 3 4 5 6 7 8 9
>
(5)列的添加與刪除
列的添加
單列添加
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[, a := 'k']
> DT
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> DT[,c:=8] # add a numeric column, 8 for all rows
> DT
x y v a c
1: a 1 1 k 8
2: a 3 2 k 8
3: a 6 3 k 8
4: b 1 4 k 8
5: b 3 5 k 8
6: b 6 6 k 8
7: c 1 7 k 8
8: c 3 8 k 8
9: c 6 9 k 8
> DT[,d:=9L] # add an integer column, 9L for all rows
> DT
x y v a c d
1: a 1 1 k 8 9
2: a 3 2 k 8 9
3: a 6 3 k 8 9
4: b 1 4 k 8 9
5: b 3 5 k 8 9
6: b 6 6 k 8 9
7: c 1 7 k 8 9
8: c 3 8 k 8 9
9: c 6 9 k 8 9
> DT[2,d:=10L] # subassign by reference to column d
> DT
x y v a c d
1: a 1 1 k 8 9
2: a 3 2 k 8 10
3: a 6 3 k 8 9
4: b 1 4 k 8 9
5: b 3 5 k 8 9
6: b 6 6 k 8 9
7: c 1 7 k 8 9
8: c 3 8 k 8 9
9: c 6 9 k 8 9
> DT[, e := d + 2]
> DT
x y v a c d e
1: a 1 1 k 8 9 11
2: a 3 2 k 8 10 12
3: a 6 3 k 8 9 11
4: b 1 4 k 8 9 11
5: b 3 5 k 8 9 11
6: b 6 6 k 8 9 11
7: c 1 7 k 8 9 11
8: c 3 8 k 8 9 11
9: c 6 9 k 8 9 11
>
如果添加的列名,數據中已經包含則是對這一列數據的修改。
多列的添加
> DT[, c('f', 'g') := list( d + 1, c)]
> DT[, ':='( f = d + 1, g = c)] # same
> DT
x y v a c d e f g
1: a 1 1 k 8 9 11 10 8
2: a 3 2 k 8 10 12 11 8
3: a 6 3 k 8 9 11 10 8
4: b 1 4 k 8 9 11 10 8
5: b 3 5 k 8 9 11 10 8
6: b 6 6 k 8 9 11 10 8
7: c 1 7 k 8 9 11 10 8
8: c 3 8 k 8 9 11 10 8
9: c 6 9 k 8 9 11 10 8
>
此處,需要注意的是新創建的列只能依照原有數據列,而不能依照新創建的列。例如這個例子中,G = C是可以運行,而摹= F則會提示錯誤。
(6)列的刪除
> DT[,c:=NULL] # remove column c
> DT
x y v a d e f g
1: a 1 1 k 9 11 10 8
2: a 3 2 k 10 12 11 8
3: a 6 3 k 9 11 10 8
4: b 1 4 k 9 11 10 8
5: b 3 5 k 9 11 10 8
6: b 6 6 k 9 11 10 8
7: c 1 7 k 9 11 10 8
8: c 3 8 k 9 11 10 8
9: c 6 9 k 9 11 10 8
>
> DT[, c('d', 'e', 'f', 'g'):=NULL]
> DT
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
>
列指標的簡單操作
簡單操作主要包括求和,平均值,方差和標準差等。
> DT[2:3,sum(v)] # sum(v) over rows 2 and 3
[1] 5
> DT[2:3,mean(v)] # sum(v) over rows 2 and 3
[1] 2.5
(7)索引鍵
查看和創建索引
索引是對列而言的,索引創建後,數據將自動按索引值進行重新排序,所以每個數據最多只能有一個索引,但是索引可以由多列組成,這些列可以是數字,因子,字符串或其他格式。
單列索引的創建
> ## methdod first
> key(DT) # key
[1] "y"
> setkey(DT,x) # set a 1-column key. No quotes, for convenience.
> key(DT)
[1] "x"
> DT
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> ## method second
> setkeyv(DT,"y") # same (v in setkeyv stands for vector)
> key(DT)
[1] "y"
>
一旦對數據進行新的索引,原有的索引將消失。
多列索引的創建
> ## methdod first # key
> setkey(DT,x,v) # set a 1-column key. No quotes, for convenience.
> key(DT)
[1] "x" "v"
> DT
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> ## method second
> setkeyv(DT,c("x", "y")) # same (v in setkeyv stands for vector)
> key(DT)
[1] "x" "y"
> DT
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
>
通過索引進行數據的提取
按照索引對數據提取,可以加快提取數據的速度。
單索引
正向提取
> setkey(DT, x)
> DT["a"] # binary search (fast)
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[.("a")] # same; i.e. binary search (fast)
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[x=="a"] # same; i.e. binary search (fast)
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
>
反向提取
> DT[!.("a")] # not join
x y v a
1: b 1 4 k
2: b 3 5 k
3: b 6 6 k
4: c 1 7 k
5: c 3 8 k
6: c 6 9 k
> DT[!"a"] # same
x y v a
1: b 1 4 k
2: b 3 5 k
3: b 6 6 k
4: c 1 7 k
5: c 3 8 k
6: c 6 9 k
> DT[!2:4] # all rows other than 2:4
x y v a
1: a 1 1 k
2: b 3 5 k
3: b 6 6 k
4: c 1 7 k
5: c 3 8 k
6: c 6 9 k
>
多索引
正向提取
> setkey(DT, x, y)
> # Mehtod First
> DT["a"] # join to 1st column of key
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[.("a")] # same, .() is an alias for list()
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[.("a",3)] # join to 2 columns
x y v a
1: a 3 2 k
> DT[.("a",3:6)] # join 4 rows (2 missing)
x y v a
1: a 3 2 k
2: a 4 NA NA
3: a 5 NA NA
4: a 6 3 k
> DT[.("a",3:6),nomatch=0] # remove missing
x y v a
1: a 3 2 k
2: a 6 3 k
> DT[.("a",3:6),roll=TRUE] # rolling join (locf)
x y v a
1: a 3 2 k
2: a 4 2 k
3: a 5 2 k
4: a 6 3 k
> ## Method Second
> DT[J('a')]
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[J("a",3)] # binary search (fast)
x y v a
1: a 3 2 k
> DT[J("a",3:6)] # same; i.e. binary search (fast)
x y v a
1: a 3 2 k
2: a 4 NA NA
3: a 5 NA NA
4: a 6 3 k
> DT[J("a",3:6), nomatch = 0]
x y v a
1: a 3 2 k
2: a 6 3 k
> DT[J("a",3:6), roll = T]
x y v a
1: a 3 2 k
2: a 4 2 k
3: a 5 2 k
4: a 6 3 k
> ## Method Third
> DT[list("a")]
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[list("a",3)]
x y v a
1: a 3 2 k
> DT[list("a", 3:6)]
x y v a
1: a 3 2 k
2: a 4 NA NA
3: a 5 NA NA
4: a 6 3 k
> DT[list("a", 3:6), nomatch = 0]
x y v a
1: a 3 2 k
2: a 6 3 k
> DT[list("a", 3:6), roll = T]
x y v a
1: a 3 2 k
2: a 4 2 k
3: a 5 2 k
4: a 6 3 k
>
反向提取
> DT[x!="b" | y!=3] # not yet optimized, currently vector scans
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 6 6 k
6: c 1 7 k
7: c 3 8 k
8: c 6 9 k
> DT[!.("b",3)] # same result but much faster
x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 6 6 k
6: c 1 7 k
7: c 3 8 k
8: c 6 9 k
>
4.分類彙總
分類彙總是指按某列的分類指標進行簡單操作,藉助由參數實現。此外,通過參數與索引相互沒有影響這裏。
單指標的分類彙總
默認彙總名稱
> DT[,sum(v),by=x]
x V1
1: a 6
2: b 15
3: c 24
> DT[,sum(v),by=y]
y V1
1: 1 12
2: 3 15
3: 6 18
>
自定義彙總名稱
> DT[,list(sum.v.x = sum(v)),by=x]
x sum.v.x
1: a 6
2: b 15
3: c 24
> DT[,list(sum.v.y = sum(v)),by=y]
y sum.v.y
1: 1 12
2: 3 15
3: 6 18
> DT[,sum.v.y := sum(v) ,by=y]
>
彙總結果與原始數據進行匹配
> DT[,sum.v.y := sum(v) ,by=y][]
x y v a sum.v.y
1: a 1 1 k 12
2: a 3 2 k 15
3: a 6 3 k 18
4: b 1 4 k 12
5: b 3 5 k 15
6: b 6 6 k 18
7: c 1 7 k 12
8: c 3 8 k 15
9: c 6 9 k 18
>
多指標的多個分類彙總
默認彙總名稱
> DT[,list(mean(v),sum(v)),by=list(x,y)] # keyed by
x y V1 V2
1: a 1 1 1
2: a 3 2 2
3: a 6 3 3
4: b 1 4 4
5: b 3 5 5
6: b 6 6 6
7: c 1 7 7
8: c 3 8 8
9: c 6 9 9
>
自定義彙總名稱
> DT[,list(mean.v = mean(v),sum.v = sum(v)),by=list(x,y)] # keyed by
x y mean.v sum.v
1: a 1 1 1
2: a 3 2 2
3: a 6 3 3
4: b 1 4 4
5: b 3 5 5
6: b 6 6 6
7: c 1 7 7
8: c 3 8 8
9: c 6 9 9
>
彙總結果與原始數據進行匹配
> DT[,c("mean.v", "sum.v.y") := list(mean(v),sum(v)) ,by=list(x,y)][]
x y v a sum.v.y mean.v
1: a 1 1 k 1 1
2: a 3 2 k 2 2
3: a 6 3 k 3 3
4: b 1 4 k 4 4
5: b 3 5 k 5 5
6: b 6 6 k 6 6
7: c 1 7 k 7 7
8: c 3 8 k 8 8
9: c 6 9 k 9 9
>
data.table與data.frame的轉化
data.table格式加快了處理速度,而data.frame則更爲基礎。兩者的轉化可以通過data.table(),setDT()和setDT()來實現,其中data.table()和setDT()函數可以將數據從data.frame轉化爲data.table,setDF()函數可以將數據從data.table轉化爲data.frame。注意使用data.table(),setDT()和setDT()時,參數本身的數據類型也會發生變化。
> class(DT)
[1] "data.table" "data.frame"
> class(setDF(DT))
[1] "data.frame"
> class(DT)
[1] "data.frame"
>
此外,data.table包還可以與基礎包中的重複的(),唯一的(),子()函數結合使用。不僅如此,data.table包還有一些基礎包的替代函數.rbind()升級版的rbindlist(),可以合併列數不同和列位置不同的數據。比dplyr包中安排()函數更快的setorder()排序函數。
資料參考:
data.table包使用簡介
data.table–cran
R–data.table介紹學習
R–data.table速查手冊