【數據處理】R語言--data.table介紹以及例子

R—data.table介紹以及例子

相比dplyr包,data.table包能夠更大程度地提高數據的處理速度,這裏就簡單介紹一下data.tale包的使用方法。

data.table:用於快速處理大數據集的

1.數據的讀取

data.table包中數據讀取的函數:fread()

2.data.table的創建

> library(data.table)
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> 

3.基礎操作

(1)行提取

行提取分爲單行提取和多行提取。

單行提取

> DT[2]                      # 2nd row
   x y v
1: a 3 2

> DT[2,]                     # same
   x y v
1: a 3 2
> 

這裏DT [2]和DT [2]是完全相同的,這裏的「,」只是說明還有其他參數可設置,而其他參數按默認值進行計算。下文所有這樣的最後一個「,」都不再寫出來。

多行提取

數字提取

> DT[1:2]
   x y v
1: a 1 1
2: a 3 2

> DT[c(2,5)]
   x y v
1: a 3 2
2: b 3 5
> 

(2)邏輯提取

DT[c(FALSE,TRUE)]          # even rows (usual recycling)
    x y v
 1: a 3 2
 2: b 1 4
 3: b 6 6
 4: c 3 8

此時,C(FALSE,TRUE)會自己重複匹配成與DT的行數相同的向量

(3)列提取

與行提取相同,列的提取也包含單列提取和多列提取。

單列提取

數字提取
數字提取時,一定要把問心無愧參數設置爲FALSE。

> DT[,2,with=FALSE]          # 2nd column
   y
1: 1
2: 3
3: 6
4: 1
5: 3
6: 6
7: 1
8: 3
9: 6
> DT[,2,with=TRUE]          # 2nd column
   y
1: 1
2: 3
3: 6
4: 1
5: 3
6: 6
7: 1
8: 3
9: 6
> 

按照列名提取


> DT[,list(v)]      # v column (as data.table)
   v
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9

(4)列名的修改

列名的修改可以使用setnames()函數,這個函數好像比對data.frame類型數據名更改的名稱()和colnames()函數也要快一些。

> dt = data.table(a=1:2,b=3:4,c=5:6) # compare to data.table
> dt
   a b c
1: 1 3 5
2: 2 4 6
> try(tracemem(dt))                  # by reference, no deep or shallow copies
[1] "<000000001C613E10>"
> dt
   a b c
1: 1 3 5
2: 2 4 6
> setnames(dt,"b","B")               # by name, no match() needed (warning if "b" is missing)
> dt
   a B c
1: 1 3 5
2: 2 4 6
> setnames(dt,3,"C")                 # by position with warning if 3 > ncol(dt)
> dt
   a B C
1: 1 3 5
2: 2 4 6
> setnames(dt,2:3,c("D","E"))        # multiple
> dt
   a D E
1: 1 3 5
2: 2 4 6
> setnames(dt,c("a","E"),c("A","F")) # multiple by name (warning if either "a" or "E" is missing)
> dt
   A D F
1: 1 3 5
2: 2 4 6
> setnames(dt,c("X","Y","Z"))        # replace all (length of names must be == ncol(DT))  
> dt
   X Y Z
1: 1 3 5
2: 2 4 6
> 

多列提取

數字提取
如同上面對按數字對單列的提取,對多列提取也要設置與參數爲FALSE。


> DT[,2:3,with=FALSE]
   y v
1: 1 1
2: 3 2
3: 6 3
4: 1 4
5: 3 5
6: 6 6
7: 1 7
8: 3 8
9: 6 9
> DT[,c(1,3),with=FALSE] 
   x v
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
6: b 6
7: c 7
8: c 8
9: c 9
> 

按列名提取

> DT[,list(y, v)]
   y v
1: 1 1
2: 3 2
3: 6 3
4: 1 4
5: 3 5
6: 6 6
7: 1 7
8: 3 8
9: 6 9
> 

如果按列名提取時,不使用列表,仍然能對列進行提取,只是結果以向量的形式輸出。

> DT[,v]                     # v column (as vector)
[1] 1 2 3 4 5 6 7 8 9
> DT[,c(v)]                  # same
[1] 1 2 3 4 5 6 7 8 9
> DT[, c(y, v)]
 [1] 1 3 6 1 3 6 1 3 6 1 2 3 4 5 6 7 8 9
> 

(5)列的添加與刪除

列的添加

單列添加

> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[, a := 'k']
> DT
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> DT[,c:=8]        # add a numeric column, 8 for all rows
> DT
   x y v a c
1: a 1 1 k 8
2: a 3 2 k 8
3: a 6 3 k 8
4: b 1 4 k 8
5: b 3 5 k 8
6: b 6 6 k 8
7: c 1 7 k 8
8: c 3 8 k 8
9: c 6 9 k 8
> DT[,d:=9L]       # add an integer column, 9L for all rows
> DT
   x y v a c d
1: a 1 1 k 8 9
2: a 3 2 k 8 9
3: a 6 3 k 8 9
4: b 1 4 k 8 9
5: b 3 5 k 8 9
6: b 6 6 k 8 9
7: c 1 7 k 8 9
8: c 3 8 k 8 9
9: c 6 9 k 8 9
> DT[2,d:=10L]     # subassign by reference to column d
> DT
   x y v a c  d
1: a 1 1 k 8  9
2: a 3 2 k 8 10
3: a 6 3 k 8  9
4: b 1 4 k 8  9
5: b 3 5 k 8  9
6: b 6 6 k 8  9
7: c 1 7 k 8  9
8: c 3 8 k 8  9
9: c 6 9 k 8  9
> DT[, e := d + 2]
> DT
   x y v a c  d  e
1: a 1 1 k 8  9 11
2: a 3 2 k 8 10 12
3: a 6 3 k 8  9 11
4: b 1 4 k 8  9 11
5: b 3 5 k 8  9 11
6: b 6 6 k 8  9 11
7: c 1 7 k 8  9 11
8: c 3 8 k 8  9 11
9: c 6 9 k 8  9 11
> 

如果添加的列名,數據中已經包含則是對這一列數據的修改。

多列的添加

> DT[, c('f', 'g') := list( d + 1, c)]
> DT[, ':='( f =  d + 1, g = c)]          # same
> DT
   x y v a c  d  e  f g
1: a 1 1 k 8  9 11 10 8
2: a 3 2 k 8 10 12 11 8
3: a 6 3 k 8  9 11 10 8
4: b 1 4 k 8  9 11 10 8
5: b 3 5 k 8  9 11 10 8
6: b 6 6 k 8  9 11 10 8
7: c 1 7 k 8  9 11 10 8
8: c 3 8 k 8  9 11 10 8
9: c 6 9 k 8  9 11 10 8
> 

此處,需要注意的是新創建的列只能依照原有數據列,而不能依照新創建的列。例如這個例子中,G = C是可以運行,而摹= F則會提示錯誤。

(6)列的刪除


> DT[,c:=NULL]     # remove column c
> DT
   x y v a  d  e  f g
1: a 1 1 k  9 11 10 8
2: a 3 2 k 10 12 11 8
3: a 6 3 k  9 11 10 8
4: b 1 4 k  9 11 10 8
5: b 3 5 k  9 11 10 8
6: b 6 6 k  9 11 10 8
7: c 1 7 k  9 11 10 8
8: c 3 8 k  9 11 10 8
9: c 6 9 k  9 11 10 8
> 
> DT[, c('d', 'e', 'f', 'g'):=NULL]     
> DT
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> 

列指標的簡單操作
簡單操作主要包括求和,平均值,方差和標準差等。

> DT[2:3,sum(v)]             # sum(v) over rows 2 and 3
[1] 5
> DT[2:3,mean(v)]             # sum(v) over rows 2 and 3
[1] 2.5

(7)索引鍵

查看和創建索引

索引是對列而言的,索引創建後,數據將自動按索引值進行重新排序,所以每個數據最多只能有一個索引,但是索引可以由多列組成,這些列可以是數字,因子,字符串或其他格式。

單列索引的創建

> ## methdod first
> key(DT)                    # key
[1] "y"
> setkey(DT,x)               # set a 1-column key. No quotes, for convenience.
> key(DT)
[1] "x"
> DT
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> ## method second
> setkeyv(DT,"y")            # same (v in setkeyv stands for vector)
> key(DT)
[1] "y"
> 

一旦對數據進行新的索引,原有的索引將消失。

多列索引的創建

> ## methdod first                    # key 
> setkey(DT,x,v)               # set a 1-column key. No quotes, for convenience.
> key(DT)
[1] "x" "v"
> DT
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> ## method second
> setkeyv(DT,c("x", "y"))           # same (v in setkeyv stands for vector)
> key(DT)
[1] "x" "y"
> DT
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 3 5 k
6: b 6 6 k
7: c 1 7 k
8: c 3 8 k
9: c 6 9 k
> 

通過索引進行數據的提取

按照索引對數據提取,可以加快提取數據的速度。

單索引

正向提取

> setkey(DT, x)
> DT["a"]                    # binary search (fast)
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[.("a")]                 # same; i.e. binary search (fast)
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[x=="a"]                 # same; i.e. binary search (fast)
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> 

反向提取

> DT[!.("a")]                # not join
   x y v a
1: b 1 4 k
2: b 3 5 k
3: b 6 6 k
4: c 1 7 k
5: c 3 8 k
6: c 6 9 k
> DT[!"a"]                   # same
   x y v a
1: b 1 4 k
2: b 3 5 k
3: b 6 6 k
4: c 1 7 k
5: c 3 8 k
6: c 6 9 k
> DT[!2:4]                   # all rows other than 2:4
   x y v a
1: a 1 1 k
2: b 3 5 k
3: b 6 6 k
4: c 1 7 k
5: c 3 8 k
6: c 6 9 k
> 

多索引

正向提取

> setkey(DT, x, y)
> # Mehtod First
> DT["a"]                    # join to 1st column of key
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[.("a")]                 # same, .() is an alias for list()
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[.("a",3)]               # join to 2 columns
   x y v a
1: a 3 2 k
> DT[.("a",3:6)]             # join 4 rows (2 missing)
   x y  v  a
1: a 3  2  k
2: a 4 NA NA
3: a 5 NA NA
4: a 6  3  k
> DT[.("a",3:6),nomatch=0]   # remove missing
   x y v a
1: a 3 2 k
2: a 6 3 k
> DT[.("a",3:6),roll=TRUE]   # rolling join (locf)
   x y v a
1: a 3 2 k
2: a 4 2 k
3: a 5 2 k
4: a 6 3 k
> ## Method Second
> DT[J('a')]
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[J("a",3)]               # binary search (fast)
   x y v a
1: a 3 2 k
> DT[J("a",3:6)]              # same; i.e. binary search (fast)
   x y  v  a
1: a 3  2  k
2: a 4 NA NA
3: a 5 NA NA
4: a 6  3  k
> DT[J("a",3:6), nomatch = 0]
   x y v a
1: a 3 2 k
2: a 6 3 k
> DT[J("a",3:6), roll = T]
   x y v a
1: a 3 2 k
2: a 4 2 k
3: a 5 2 k
4: a 6 3 k
> ## Method Third
> DT[list("a")]
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
> DT[list("a",3)]
   x y v a
1: a 3 2 k
> DT[list("a", 3:6)]
   x y  v  a
1: a 3  2  k
2: a 4 NA NA
3: a 5 NA NA
4: a 6  3  k
> DT[list("a", 3:6), nomatch = 0]
   x y v a
1: a 3 2 k
2: a 6 3 k
> DT[list("a", 3:6), roll = T]
   x y v a
1: a 3 2 k
2: a 4 2 k
3: a 5 2 k
4: a 6 3 k
> 

反向提取

> DT[x!="b" | y!=3]          # not yet optimized, currently vector scans
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 6 6 k
6: c 1 7 k
7: c 3 8 k
8: c 6 9 k
> DT[!.("b",3)]              # same result but much faster
   x y v a
1: a 1 1 k
2: a 3 2 k
3: a 6 3 k
4: b 1 4 k
5: b 6 6 k
6: c 1 7 k
7: c 3 8 k
8: c 6 9 k
> 

4.分類彙總

分類彙總是指按某列的分類指標進行簡單操作,藉助由參數實現。此外,通過參數與索引相互沒有影響這裏。

單指標的分類彙總

默認彙總名稱

> DT[,sum(v),by=x]
   x V1
1: a  6
2: b 15
3: c 24
> DT[,sum(v),by=y] 
   y V1
1: 1 12
2: 3 15
3: 6 18
>   

自定義彙總名稱

> DT[,list(sum.v.x = sum(v)),by=x]
   x sum.v.x
1: a       6
2: b      15
3: c      24
> DT[,list(sum.v.y = sum(v)),by=y] 
   y sum.v.y
1: 1      12
2: 3      15
3: 6      18
> DT[,sum.v.y := sum(v) ,by=y]
> 

彙總結果與原始數據進行匹配

> DT[,sum.v.y := sum(v) ,by=y][]
   x y v a sum.v.y
1: a 1 1 k      12
2: a 3 2 k      15
3: a 6 3 k      18
4: b 1 4 k      12
5: b 3 5 k      15
6: b 6 6 k      18
7: c 1 7 k      12
8: c 3 8 k      15
9: c 6 9 k      18
> 

多指標的多個分類彙總

默認彙總名稱

> DT[,list(mean(v),sum(v)),by=list(x,y)]   # keyed by
   x y V1 V2
1: a 1  1  1
2: a 3  2  2
3: a 6  3  3
4: b 1  4  4
5: b 3  5  5
6: b 6  6  6
7: c 1  7  7
8: c 3  8  8
9: c 6  9  9
> 

自定義彙總名稱

> DT[,list(mean.v = mean(v),sum.v = sum(v)),by=list(x,y)]   # keyed by
   x y mean.v sum.v
1: a 1      1     1
2: a 3      2     2
3: a 6      3     3
4: b 1      4     4
5: b 3      5     5
6: b 6      6     6
7: c 1      7     7
8: c 3      8     8
9: c 6      9     9
> 

彙總結果與原始數據進行匹配

> DT[,c("mean.v", "sum.v.y") := list(mean(v),sum(v)) ,by=list(x,y)][]
   x y v a sum.v.y mean.v
1: a 1 1 k       1      1
2: a 3 2 k       2      2
3: a 6 3 k       3      3
4: b 1 4 k       4      4
5: b 3 5 k       5      5
6: b 6 6 k       6      6
7: c 1 7 k       7      7
8: c 3 8 k       8      8
9: c 6 9 k       9      9
> 

data.table與data.frame的轉化
data.table格式加快了處理速度,而data.frame則更爲基礎。兩者的轉化可以通過data.table(),setDT()和setDT()來實現,其中data.table()和setDT()函數可以將數據從data.frame轉化爲data.table,setDF()函數可以將數據從data.table轉化爲data.frame。注意使用data.table(),setDT()和setDT()時,參數本身的數據類型也會發生變化。

> class(DT)
[1] "data.table" "data.frame"
> class(setDF(DT))
[1] "data.frame"
> class(DT)
[1] "data.frame"
> 

此外,data.table包還可以與基礎包中的重複的(),唯一的(),子()函數結合使用。不僅如此,data.table包還有一些基礎包的替代函數.rbind()升級版的rbindlist(),可以合併列數不同和列位置不同的數據。比dplyr包中安排()函數更快的setorder()排序函數。

資料參考:

data.table包使用簡介
data.table–cran
R–data.table介紹學習
R–data.table速查手冊

發佈了34 篇原創文章 · 獲贊 54 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章