數據讀寫
對離散變量,我們會觀測變量各個層級觀測的頻數,或者使用兩個變量的交叉表格,對離散變量繪製條形圖等;
對連續變量,我們會看某個變量的均值,標準差,分位數等
此外,summary(),str(),describe(()等函數(psych包裏)做義工數據框的總結。
以上即爲一些最基礎的方法,但這些方法靈活性不高,輸出的信息也是固定的,這時我們需要對數據進行整形。
在整合和整形操作前,我們介紹一個新的可以取代數據框的對象,tibble,一個可以高效讀取數據集的包readr。最後會介紹兩個用於數據整形的包:reshape2和tidyr包
取代傳統數據框的tibble對象
> library(tibble)
> library(tibble)
> library(ggplot2)
> sim.dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> df=data.frame(x=c(1:5),y=rep("a",5))
> as_tibble(df)
# A tibble: 5 x 2
x y
<int> <fctr>
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
> tibble(x=1:5,y=rep("a",5))
# A tibble: 5 x 2
x y
<int> <chr>
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
>
> tibble(x=1:5,y=1,z=x^2+y)
# A tibble: 5 x 3
x y z
<int> <dbl> <dbl>
1 1 1 2
2 2 1 5
3 3 1 10
4 4 1 17
5 5 1 26
> tb=tibble(':)'="smile",' '="space",'2000'="number")
> print(tb)
# A tibble: 1 x 3
`:)` ` ` `2000`
<chr> <chr> <chr>
1 smile space number
>
特別,如果你在其他包中使用tibble對象中的變量也需要加單引號。
tibble和傳統數據框的不同主要在於輸出顯示和截取變量這兩個方面
1.輸出顯示
> print(as_tibble(sim.dat))
# A tibble: 1,000 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <fctr> <dbl> <fctr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <fctr>
如上,它只展示頭10行數據,而且會根據屏幕大小,自動調整列數,列名後還會顯示每列的類型,更友好。
2.截取變量
從tibble對象中截取某一變量
用"$"
和"[["
符號
“[[”符號能夠通過變量的名字或位置指針來截取
“$”
只能通過變量名截取
“%>%"
(管道操作符)也可進行數據截取
sim.dat$age
sim.dat[["age"]]
sim.dat[[1]]
library(dplyr)
sim.dat%>%.$age
sim.dat%>%.[["age"]]
若用"$"
或"[["
操作符從數據框中截取一個變量時,截取的變量可能不是數據框形式,從而可能會引起程序運行錯誤,但是從tibble中截取任何一個變量依舊是一個tibble對象
注意:由於tibble對象比較新,所以在清理了數據之後要對數據建模的話,可以將tibble對象轉換成原始數據框格式
sim.dat=as.data.frame(sim.dat)
class(sim.dat)
高效數據讀寫 readr包
readr包中用於讀入數據的函數:
read_csv()讀入逗號分隔文件
read_csv2()讀入分號分隔文件
read_tsv()讀人制表符分隔文件
read_delim()讀入任意分隔符文件
其中,read_csv()涵蓋了大部分的數據讀入需求。
#skip=2表示跳過兩行
> dat=read_csv("這行是一個樣本數據
+ 這行只是註釋
+ x,y,z
+ 1,2,3",skip=2)
> print(dat)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
> dat=read_csv("1,2,3\n4,5,6",col_names=FALSE)
> print(dat)
# A tibble: 2 x 3
X1 X2 X3
<int> <int> <int>
1 1 2 3
2 4 5 6
對於分號分隔文件讀取read_csv2()
> dat=read_csv2("x;y;z\n1;2;3")
> print(dat)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
對於製表符分隔文件,read_tsv()
> dat1=read_tsv("x\ty\tz\n1\t2\t3")
> print(dat1)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
讀入任意分隔符read_delim()
> dat2=read_delim("x|y|z\n1|2|3",delim=
+ "|")
> print(dat2)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
>
指定缺失值
> dat=read_csv("x,y,z\n1,2,99",na="99")
> print(dat)
# A tibble: 1 x 3
x y z
<int> <int> <chr>
1 1 2 <NA>
>
readr包也有兩個存儲數據的函數write_csv()和write_tsv()函數,它們的優點在於:
1.對於字符串採用utf-8編碼
2.將日期和時間用ISO8601格式存儲,便於其他軟件解析y
也可以使用write_excel_csv()函數j將.csv格式數據導出成excel格式
對於其他類型的數據,可使用下面的包
Haven:讀入SPASS,Stata和SAS數據
Readxl:讀取Excel文檔(.xls和xlsx)
DBI:在指定了相應數據庫(mysql等)情況下,直接從數據庫中通過SQL讀取數據。
數據表對象讀取:
我們可以用方括號對數據進行索引和搜索。
簡單的數據整合也可以用tapply(),aggregate(),table()這些函數
數據框的方括號易於實現數據截取,但是對數據進一步整合,需要其他包的幫助,如果能在方括號中進行數據整合操作,便方便了許多。data.table就可以做到這一點
1、它能更有效處理大數據集
2、操作方式和數據框一樣簡便
3、能夠快速實現數據截取,分組,合併
4、可以輕易將數據框結構轉化爲數據表結構
#注,傳統的數據框無法進行該操作
> dt[,mean(online_trans)]
[1] 13.546
> dt[,mean(online_trans),by=gender]
gender V1
1: Female 15.38448
2: Male 11.26233
> dt[,mean(online_trans),by=.(gender,house)]
gender house V1
1: Female Yes 11.312030
2: Male Yes 8.771523
3: Female No 19.145833
4: Male No 16.486111
> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
gender house avg
1: Female Yes 11.312030
2: Male Yes 8.771523
3: Female No 19.145833
4: Male No 16.486111
數據表的操作類似於sql
如:select gender,avg(online_trans) from sim.dat groupby gender
等價於
> dt[,mean(online_trans),by=gender]
gender V1
1: Female 15.38448
2: Male 11.26233
>
select gender,house,avg(online_trans) as avg from sim.dat group by gender,house
等價於
> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
gender house avg
1: Female Yes 11.312030
2: Male Yes 8.771523
3: Female No 19.145833
4: Male No 16.486111
>
select gender,house,avg(online_trans) as avg from
sim.dat where age <40 groupby gender,house
> dt[age<40,.(avg=mean(online_trans)),by=.(gender,house)]
gender house avg
1: Male Yes 14.45977
2: Female Yes 18.14062
3: Male No 18.24299
4: Female No 20.10196
選擇行
> dt[age<20&income>80000]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 19 Female 83534.70 No 227.6686 1490.719 1 22 2 1
2: 18 Female 89415.97 Yes 209.5487 1926.470 3 28 2 1
3: 19 Female 92812.81 No 186.7475 1041.539 2 18 3 1
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 1 2 4 1 4 2 4 1 Style
2: 1 1 4 1 4 2 4 1 Style
3: 1 2 4 1 4 3 4 1 Style
> dt[1:2]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 57 Female 120963.4 Yes 529.1344 303.5125 2 2 4 2
2: 63 Female 122008.1 Yes 478.0058 109.5297 4 2 4 1
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 1 2 1 4 1 4 2 4 Price
2: 1 2 1 4 1 4 1 4 Price
>
選擇列:
> ans=dt[,age]
> head(ans)
[1] 57 63 59 60 51 59
> abs=dt[,.(age,online_exp)]
> head(abs)
age online_exp
1: 57 303.5125
2: 63 109.5297
3: 59 279.2496
4: 60 141.6698
5: 51 112.2372
6: 59 195.6870
> ans=dt[,age:income,with=FALSE]
> head(ans,2)
age gender income
1: 57 Female 120963.4
2: 63 Female 122008.1
#刪除某列,-可以換成!
> ans=dt[,-(age:online_exp),with=FALSE]
製表
> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
gender N
1: Female 554
2: Male 446
> dt[age<30,.(count=.N),by=gender]
gender count
1: Female 292
2: Male 86
> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
gender N
1: Female 554
2: Male 446
> dt[age<30,.(count=.N),by=gender]
gender count
1: Female 292
2: Male 86
> head(dt[order(-online_exp)],5)
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 40 Female 217599.7 No 7023.684 9479.442 10 6 1 4
2: 41 Female NA Yes 3786.740 8638.239 14 10 1 4
3: 36 Male 228550.1 Yes 3279.621 8220.555 8 12 1 4
4: 31 Female 159508.1 Yes 5177.081 8005.932 11 13 1 4
5: 43 Female 190407.4 Yes 4694.922 7875.562 6 11 1 4
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 5 4 3 4 4 1 4 2 Conspicuous
2: 4 4 4 4 4 1 4 2 Conspicuous
3: 5 4 4 4 4 1 4 1 Conspicuous
4: 4 4 4 4 4 1 4 2 Conspicuous
5: 5 4 4 4 4 1 4 2 Conspicuous
> dt[order(-online_exp)][1:5]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 40 Female 217599.7 No 7023.684 9479.442 10 6 1 4
2: 41 Female NA Yes 3786.740 8638.239 14 10 1 4
3: 36 Male 228550.1 Yes 3279.621 8220.555 8 12 1 4
4: 31 Female 159508.1 Yes 5177.081 8005.932 11 13 1 4
5: 43 Female 190407.4 Yes 4694.922 7875.562 6 11 1 4
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 5 4 3 4 4 1 4 2 Conspicuous
2: 4 4 4 4 4 1 4 2 Conspicuous
3: 5 4 4 4 4 1 4 1 Conspicuous
4: 4 4 4 4 4 1 4 2 Conspicuous
5: 5 4 4 4 4 1 4 2 Conspicuous
> dt[order(gender,-online_exp)][1:5]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 40 Female 217599.7 No 7023.684 9479.442 10 6 1 4
2: 41 Female NA Yes 3786.740 8638.239 14 10 1 4
3: 31 Female 159508.1 Yes 5177.081 8005.932 11 13 1 4
4: 43 Female 190407.4 Yes 4694.922 7875.562 6 11 1 4
5: 50 Female 263858.0 Yes 5813.802 7448.729 11 11 1 4
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 5 4 3 4 4 1 4 2 Conspicuous
2: 4 4 4 4 4 1 4 2 Conspicuous
3: 4 4 4 4 4 1 4 2 Conspicuous
4: 5 4 4 4 4 1 4 2 Conspicuous
5: 5 4 4 4 4 1 4 1 Conspicuous
>
用fread()讀取數據
data.table中的fread()函數讀取速度比read_csv()更快!!!
數據整合
base包:apply(),lapply(),sapply()等
> sdat=sim.dat[,!lapply(sim.dat,class)=="factor"]
> apply(sim.dat,2,class)
> apply(sdat,MARGIN=2,function(x) mean(na.omit(x)))
> apply(sdat,MARGIN=2,function(x) sd(na.omit(x)))
plyr包:ddply()
#數據框顯示
> ddply(sim.dat,"segment",summarize,avg_online=round(sum(online_exp)/sum(online_trans),2),avg_store=round(sum(store_exp)/sum(store_trans),2))
segment avg_online avg_store
1 Conspicuous 442.27 479.25
2 Price 69.28 81.30
3 Quality 126.05 105.12
4 Style 92.83 121.07
>
dplyr包(專門處理數據框)–其主要功能:
1.數據框顯示
2.數據截取
3.數據總結
4.生成新變量
5.合併數據集
> dplyr::tbl_df(sim.dat)
# A tibble: 1,000 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>
> dplyr::glimpse(sim.dat)
Observations: 1,000
Variables: 19
$ age <int> 57, 63, 59, 60, 51, 59, 57, 57, 61, 60, 58, 59, 64, 57,...
$ gender <chr> "Female", "Female", "Male", "Male", "Male", "Male", "Ma...
$ income <dbl> 120963.4, 122008.1, 114202.3, 113616.3, 124252.6, 10766...
$ house <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ store_exp <dbl> 529.1344, 478.0058, 490.8107, 347.8090, 379.6259, 338.3...
$ online_exp <dbl> 303.5125, 109.5297, 279.2496, 141.6698, 112.2372, 195.6...
$ store_trans <int> 2, 4, 7, 10, 4, 4, 5, 11, 6, 12, 5, 6, 7, 7, 5, 5, 5, 5...
$ online_trans <int> 2, 2, 2, 2, 4, 5, 3, 5, 1, 1, 4, 2, 4, 3, 5, 1, 3, 2, 2...
$ Q1 <int> 4, 4, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 4...
$ Q2 <int> 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2...
$ Q3 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q4 <int> 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3...
$ Q5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q6 <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q7 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q8 <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q9 <int> 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2...
$ Q10 <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ segment <chr> "Price", "Price", "Price", "Price", "Price", "Price", "...
數據截取(按行/列)
> library(magrittr)
> library(dplyr)
> dplyr::filter(sim.dat,income>300000) %>%
+ dplyr::tbl_df()
# A tibble: 4 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 40 Male 301398.0 Yes 4840.461 3618.212 10 11
2 33 Male 319704.3 Yes 5998.305 4395.923 9 11
3 41 Male 317476.2 Yes 3029.844 4179.671 11 12
4 37 Female 315697.2 Yes 6548.970 4284.065 13 11
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
# Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
Warning message:
package ‘bindrcpp’ was built under R version 3.4.3
此外,dinstinct()函數可以刪除數據框中重複的行;sample_frac()函數隨機選取一定比例的行,sample_n()函數隨機選取一定數目的行,slice()函數選取指定位置的行,top_n()選取某變量取值最高的若干觀測
> dplyr::distinct(sim.dat)
# A tibble: 1,000 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>
> dplyr::sample_frac(sim.dat,0.05,replace=TRUE)
# A tibble: 50 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 22 Male 91553.21 No 200.7210 1777.4974 4 27
2 34 Female 60521.76 No 299.3096 2054.1732 3 16
3 33 Male NA No 265.6550 1892.5581 2 12
4 38 Female 164506.62 Yes 3916.9309 5764.1235 11 10
5 26 Female 89461.40 No 200.4784 2449.7965 1 23
6 26 Female 105528.79 Yes 186.9383 2349.9275 5 17
7 55 Male 128194.20 Yes 595.6952 156.9314 6 2
8 35 Female 130108.64 Yes 6155.4803 6201.7090 9 13
9 36 Male NA Yes 203.3036 2202.5147 2 15
10 38 Male 267564.87 Yes 5335.1143 6052.4377 8 10
# ... with 40 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>
> dplyr::sample_n(sim.dat,10,replace=TRUE)
# A tibble: 10 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 34 Female 73234.49 No 349.5491 2081.4476 4 21
2 25 Female 90856.12 No 203.7759 2228.4818 4 23
3 37 Male 187062.94 Yes 5931.7494 1942.1789 18 11
4 34 Male 53945.69 Yes 370.5065 2305.3430 3 14
5 23 Female 81763.92 No 205.6662 1040.8967 3 24
6 300 Male 208017.46 Yes 5076.8009 6053.4853 12 11
7 56 Male NA Yes 419.6702 192.3719 3 1
8 26 Female 95341.78 No 198.9729 2036.4738 3 21
9 26 Male 78240.93 No 430.2481 2091.4694 3 14
10 27 Female 90303.46 No 198.9020 1870.3866 6 13
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
# Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
> dplyr::top_n(sim.dat,2,income)
# A tibble: 2 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 33 Male 319704.3 Yes 5998.305 4395.923 9 11
2 41 Male 317476.2 Yes 3029.844 4179.671 11 12
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
# Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
>
以及dplyr下的select()函數對列變量進行選擇(代碼略)
數據總結:(操作類似於apply()和ddply())
> dplyr::summarise(sim.dat,avg_online=mean(online_trans))
# A tibble: 1 x 1
avg_online
<dbl>
1 13.546
可以用group_by()函數根據某分類變量對觀測進行分組總結
生成新變量
mutate()函數可以進行列計算
transmute()函數與mutate()類似
> dplyr::mutate(sim.dat,total_exp=store_exp+online_exp)
# A tibble: 1,000 x 20
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 12 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>, total_exp <dbl>
合併數據集
> x=data.frame(cbind(ID=c("A","B","C"),x1=c(1,2,3)))
> y=data.frame(cbind(ID=c("B","C","D"),y1=c(T,T,F)))
> x
ID x1
1 A 1
2 B 2
3 C 3
> y
ID y1
1 B TRUE
2 C TRUE
3 D FALSE
> left_join(x,y,by="ID")
ID x1 y1
1 A 1 <NA>
2 B 2 TRUE
3 C 3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> inner_join(x,y,by="ID")
ID x1 y1
1 B 2 TRUE
2 C 3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> full_join(x,y,by="ID")
ID x1 y1
1 A 1 <NA>
2 B 2 TRUE
3 C 3 TRUE
4 D <NA> FALSE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> semi_join(x,y,by="ID")
ID x1
1 B 2
2 C 3
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> anti_join(x,y,by="ID")
ID x1
1 A 1
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
>
此外,dplur包中還有對數據框交,並,補的運算(intersect(),union(),setdiff()),以及一個數據框按行或列加到另一個數據框(bind_rows(),bind_cols())等
數據整形
reshape2包
數據先通過melt()函數將數據揉開,再通過dcast()函數將數據重塑成想要的形狀。
melt()函數能糅合數據框,列表,矩陣,表格等。
tidyr包
首先gather()函數,類似於melt()
spread()函數和gather()函數相反,後者將不同的列堆疊起來,前者將同一列分開。
separate()和unite()也是tidyr包中兩個互補函數,separate()可以將不同列分開成多列,unite()能將不同的列合併在一起。類似於paste()函數。