R語言之數據操作

數據讀寫

對離散變量,我們會觀測變量各個層級觀測的頻數,或者使用兩個變量的交叉表格,對離散變量繪製條形圖等;
對連續變量,我們會看某個變量的均值,標準差,分位數等
此外,summary(),str(),describe(()等函數(psych包裏)做義工數據框的總結。
以上即爲一些最基礎的方法,但這些方法靈活性不高,輸出的信息也是固定的,這時我們需要對數據進行整形。
在整合和整形操作前,我們介紹一個新的可以取代數據框的對象,tibble,一個可以高效讀取數據集的包readr。最後會介紹兩個用於數據整形的包:reshape2和tidyr包

取代傳統數據框的tibble對象

> library(tibble)
> library(tibble)
> library(ggplot2)
> sim.dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> df=data.frame(x=c(1:5),y=rep("a",5))
> as_tibble(df)
# A tibble: 5 x 2
      x      y
  <int> <fctr>
1     1      a
2     2      a
3     3      a
4     4      a
5     5      a
> tibble(x=1:5,y=rep("a",5))
# A tibble: 5 x 2
      x     y
  <int> <chr>
1     1     a
2     2     a
3     3     a
4     4     a
5     5     a
> 
> tibble(x=1:5,y=1,z=x^2+y)
# A tibble: 5 x 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26
> tb=tibble(':)'="smile",' '="space",'2000'="number")
> print(tb)
# A tibble: 1 x 3
   `:)`   ` ` `2000`
  <chr> <chr>  <chr>
1 smile space number
> 

特別,如果你在其他包中使用tibble對象中的變量也需要加單引號。
tibble和傳統數據框的不同主要在於輸出顯示截取變量這兩個方面
1.輸出顯示

> print(as_tibble(sim.dat))
# A tibble: 1,000 x 19
     age gender   income  house store_exp online_exp store_trans online_trans
   <int> <fctr>    <dbl> <fctr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4    Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1    Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3    Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3    Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6    Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5    Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3    Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0    Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5    Yes  608.2310   142.5503           6            1
10    60   Male 105048.8    Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <fctr>

如上,它只展示頭10行數據,而且會根據屏幕大小,自動調整列數,列名後還會顯示每列的類型,更友好。
2.截取變量
從tibble對象中截取某一變量
"$""[["符號
“[[”符號能夠通過變量的名字或位置指針來截取
“$”只能通過變量名截取
“%>%"(管道操作符)也可進行數據截取

sim.dat$age
sim.dat[["age"]]
sim.dat[[1]]
library(dplyr)
sim.dat%>%.$age
sim.dat%>%.[["age"]]

若用"$""[["操作符從數據框中截取一個變量時,截取的變量可能不是數據框形式,從而可能會引起程序運行錯誤,但是從tibble中截取任何一個變量依舊是一個tibble對象
注意:由於tibble對象比較新,所以在清理了數據之後要對數據建模的話,可以將tibble對象轉換成原始數據框格式

sim.dat=as.data.frame(sim.dat)
class(sim.dat)

高效數據讀寫 readr包
readr包中用於讀入數據的函數:
read_csv()讀入逗號分隔文件
read_csv2()讀入分號分隔文件
read_tsv()讀人制表符分隔文件
read_delim()讀入任意分隔符文件
其中,read_csv()涵蓋了大部分的數據讀入需求。

#skip=2表示跳過兩行
> dat=read_csv("這行是一個樣本數據
+ 這行只是註釋
+ x,y,z
+ 1,2,3",skip=2)
> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

> dat=read_csv("1,2,3\n4,5,6",col_names=FALSE)
> print(dat)
# A tibble: 2 x 3
     X1    X2    X3
  <int> <int> <int>
1     1     2     3
2     4     5     6

對於分號分隔文件讀取read_csv2()

> dat=read_csv2("x;y;z\n1;2;3")

> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

對於製表符分隔文件,read_tsv()

> dat1=read_tsv("x\ty\tz\n1\t2\t3")
> print(dat1)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

讀入任意分隔符read_delim()

> dat2=read_delim("x|y|z\n1|2|3",delim=
+                     "|")
> print(dat2)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3
> 

指定缺失值

> dat=read_csv("x,y,z\n1,2,99",na="99")
> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <chr>
1     1     2  <NA>
> 

readr包也有兩個存儲數據的函數write_csv()和write_tsv()函數,它們的優點在於:
1.對於字符串採用utf-8編碼
2.將日期和時間用ISO8601格式存儲,便於其他軟件解析y
也可以使用write_excel_csv()函數j將.csv格式數據導出成excel格式
對於其他類型的數據,可使用下面的包
Haven:讀入SPASS,Stata和SAS數據
Readxl:讀取Excel文檔(.xls和xlsx)
DBI:在指定了相應數據庫(mysql等)情況下,直接從數據庫中通過SQL讀取數據。
數據表對象讀取:
我們可以用方括號對數據進行索引和搜索。
簡單的數據整合也可以用tapply(),aggregate(),table()這些函數
數據框的方括號易於實現數據截取,但是對數據進一步整合,需要其他包的幫助,如果能在方括號中進行數據整合操作,便方便了許多。data.table就可以做到這一點
1、它能更有效處理大數據集
2、操作方式和數據框一樣簡便
3、能夠快速實現數據截取,分組,合併
4、可以輕易將數據框結構轉化爲數據表結構

#注,傳統的數據框無法進行該操作
> dt[,mean(online_trans)]
[1] 13.546
> dt[,mean(online_trans),by=gender]
   gender       V1
1: Female 15.38448
2:   Male 11.26233
> dt[,mean(online_trans),by=.(gender,house)]
   gender house        V1
1: Female   Yes 11.312030
2:   Male   Yes  8.771523
3: Female    No 19.145833
4:   Male    No 16.486111
> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
   gender house       avg
1: Female   Yes 11.312030
2:   Male   Yes  8.771523
3: Female    No 19.145833
4:   Male    No 16.486111

數據表的操作類似於sql
如:select gender,avg(online_trans) from sim.dat groupby gender
等價於

> dt[,mean(online_trans),by=gender]
   gender       V1
1: Female 15.38448
2:   Male 11.26233
> 
select gender,house,avg(online_trans) as avg from sim.dat group by gender,house

等價於

> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
   gender house       avg
1: Female   Yes 11.312030
2:   Male   Yes  8.771523
3: Female    No 19.145833
4:   Male    No 16.486111
> 
select gender,house,avg(online_trans) as avg from
sim.dat where age <40 groupby gender,house
> dt[age<40,.(avg=mean(online_trans)),by=.(gender,house)]
   gender house      avg
1:   Male   Yes 14.45977
2: Female   Yes 18.14062
3:   Male    No 18.24299
4: Female    No 20.10196

選擇行

> dt[age<20&income>80000]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  19 Female 83534.70    No  227.6686   1490.719           1           22  2  1
2:  18 Female 89415.97   Yes  209.5487   1926.470           3           28  2  1
3:  19 Female 92812.81    No  186.7475   1041.539           2           18  3  1
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1:  1  2  4  1  4  2  4   1   Style
2:  1  1  4  1  4  2  4   1   Style
3:  1  2  4  1  4  3  4   1   Style
> dt[1:2]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  57 Female 120963.4   Yes  529.1344   303.5125           2            2  4  2
2:  63 Female 122008.1   Yes  478.0058   109.5297           4            2  4  1
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1:  1  2  1  4  1  4  2   4   Price
2:  1  2  1  4  1  4  1   4   Price
> 

選擇列:

> ans=dt[,age]
> head(ans)
[1] 57 63 59 60 51 59
> abs=dt[,.(age,online_exp)]
> head(abs)
   age online_exp
1:  57   303.5125
2:  63   109.5297
3:  59   279.2496
4:  60   141.6698
5:  51   112.2372
6:  59   195.6870
> ans=dt[,age:income,with=FALSE]
> head(ans,2)
   age gender   income
1:  57 Female 120963.4
2:  63 Female 122008.1
#刪除某列,-可以換成!
> ans=dt[,-(age:online_exp),with=FALSE]

製表

> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
   gender   N
1: Female 554
2:   Male 446
> dt[age<30,.(count=.N),by=gender]
   gender count
1: Female   292
2:   Male    86
> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
   gender   N
1: Female 554
2:   Male 446
> dt[age<30,.(count=.N),by=gender]
   gender count
1: Female   292
2:   Male    86
> head(dt[order(-online_exp)],5)
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  40 Female 217599.7    No  7023.684   9479.442          10            6  1  4
2:  41 Female       NA   Yes  3786.740   8638.239          14           10  1  4
3:  36   Male 228550.1   Yes  3279.621   8220.555           8           12  1  4
4:  31 Female 159508.1   Yes  5177.081   8005.932          11           13  1  4
5:  43 Female 190407.4   Yes  4694.922   7875.562           6           11  1  4
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
1:  5  4  3  4  4  1  4   2 Conspicuous
2:  4  4  4  4  4  1  4   2 Conspicuous
3:  5  4  4  4  4  1  4   1 Conspicuous
4:  4  4  4  4  4  1  4   2 Conspicuous
5:  5  4  4  4  4  1  4   2 Conspicuous
> dt[order(-online_exp)][1:5]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  40 Female 217599.7    No  7023.684   9479.442          10            6  1  4
2:  41 Female       NA   Yes  3786.740   8638.239          14           10  1  4
3:  36   Male 228550.1   Yes  3279.621   8220.555           8           12  1  4
4:  31 Female 159508.1   Yes  5177.081   8005.932          11           13  1  4
5:  43 Female 190407.4   Yes  4694.922   7875.562           6           11  1  4
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
1:  5  4  3  4  4  1  4   2 Conspicuous
2:  4  4  4  4  4  1  4   2 Conspicuous
3:  5  4  4  4  4  1  4   1 Conspicuous
4:  4  4  4  4  4  1  4   2 Conspicuous
5:  5  4  4  4  4  1  4   2 Conspicuous
> dt[order(gender,-online_exp)][1:5]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  40 Female 217599.7    No  7023.684   9479.442          10            6  1  4
2:  41 Female       NA   Yes  3786.740   8638.239          14           10  1  4
3:  31 Female 159508.1   Yes  5177.081   8005.932          11           13  1  4
4:  43 Female 190407.4   Yes  4694.922   7875.562           6           11  1  4
5:  50 Female 263858.0   Yes  5813.802   7448.729          11           11  1  4
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
1:  5  4  3  4  4  1  4   2 Conspicuous
2:  4  4  4  4  4  1  4   2 Conspicuous
3:  4  4  4  4  4  1  4   2 Conspicuous
4:  5  4  4  4  4  1  4   2 Conspicuous
5:  5  4  4  4  4  1  4   1 Conspicuous
> 

用fread()讀取數據
data.table中的fread()函數讀取速度比read_csv()更快!!!

數據整合

base包:apply(),lapply(),sapply()等

> sdat=sim.dat[,!lapply(sim.dat,class)=="factor"]
> apply(sim.dat,2,class)
> apply(sdat,MARGIN=2,function(x) mean(na.omit(x)))
> apply(sdat,MARGIN=2,function(x) sd(na.omit(x)))

plyr包:ddply()

#數據框顯示
> ddply(sim.dat,"segment",summarize,avg_online=round(sum(online_exp)/sum(online_trans),2),avg_store=round(sum(store_exp)/sum(store_trans),2))
      segment avg_online avg_store
1 Conspicuous     442.27    479.25
2       Price      69.28     81.30
3     Quality     126.05    105.12
4       Style      92.83    121.07
> 

dplyr包(專門處理數據框)–其主要功能:
1.數據框顯示
2.數據截取
3.數據總結
4.生成新變量
5.合併數據集

> dplyr::tbl_df(sim.dat)
# A tibble: 1,000 x 19
     age gender   income house store_exp online_exp store_trans online_trans
   <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4   Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1   Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3   Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3   Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6   Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5   Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3   Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0   Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5   Yes  608.2310   142.5503           6            1
10    60   Male 105048.8   Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>
> dplyr::glimpse(sim.dat)
Observations: 1,000
Variables: 19
$ age          <int> 57, 63, 59, 60, 51, 59, 57, 57, 61, 60, 58, 59, 64, 57,...
$ gender       <chr> "Female", "Female", "Male", "Male", "Male", "Male", "Ma...
$ income       <dbl> 120963.4, 122008.1, 114202.3, 113616.3, 124252.6, 10766...
$ house        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ store_exp    <dbl> 529.1344, 478.0058, 490.8107, 347.8090, 379.6259, 338.3...
$ online_exp   <dbl> 303.5125, 109.5297, 279.2496, 141.6698, 112.2372, 195.6...
$ store_trans  <int> 2, 4, 7, 10, 4, 4, 5, 11, 6, 12, 5, 6, 7, 7, 5, 5, 5, 5...
$ online_trans <int> 2, 2, 2, 2, 4, 5, 3, 5, 1, 1, 4, 2, 4, 3, 5, 1, 3, 2, 2...
$ Q1           <int> 4, 4, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 4...
$ Q2           <int> 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2...
$ Q3           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q4           <int> 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3...
$ Q5           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q6           <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q7           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q8           <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q9           <int> 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2...
$ Q10          <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ segment      <chr> "Price", "Price", "Price", "Price", "Price", "Price", "...
數據截取(按行/列)

> library(magrittr)
>  library(dplyr)
>  dplyr::filter(sim.dat,income>300000) %>%
+ dplyr::tbl_df()
# A tibble: 4 x 19
    age gender   income house store_exp online_exp store_trans online_trans
  <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
1    40   Male 301398.0   Yes  4840.461   3618.212          10           11
2    33   Male 319704.3   Yes  5998.305   4395.923           9           11
3    41   Male 317476.2   Yes  3029.844   4179.671          11           12
4    37 Female 315697.2   Yes  6548.970   4284.065          13           11
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
#   Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
Warning message:
package ‘bindrcpp’ was built under R version 3.4.3 

此外,dinstinct()函數可以刪除數據框中重複的行;sample_frac()函數隨機選取一定比例的行,sample_n()函數隨機選取一定數目的行,slice()函數選取指定位置的行,top_n()選取某變量取值最高的若干觀測

> dplyr::distinct(sim.dat)
# A tibble: 1,000 x 19
     age gender   income house store_exp online_exp store_trans online_trans
   <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4   Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1   Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3   Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3   Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6   Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5   Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3   Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0   Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5   Yes  608.2310   142.5503           6            1
10    60   Male 105048.8   Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>
> dplyr::sample_frac(sim.dat,0.05,replace=TRUE)
# A tibble: 50 x 19
     age gender    income house store_exp online_exp store_trans online_trans
   <int>  <chr>     <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    22   Male  91553.21    No  200.7210  1777.4974           4           27
 2    34 Female  60521.76    No  299.3096  2054.1732           3           16
 3    33   Male        NA    No  265.6550  1892.5581           2           12
 4    38 Female 164506.62   Yes 3916.9309  5764.1235          11           10
 5    26 Female  89461.40    No  200.4784  2449.7965           1           23
 6    26 Female 105528.79   Yes  186.9383  2349.9275           5           17
 7    55   Male 128194.20   Yes  595.6952   156.9314           6            2
 8    35 Female 130108.64   Yes 6155.4803  6201.7090           9           13
 9    36   Male        NA   Yes  203.3036  2202.5147           2           15
10    38   Male 267564.87   Yes 5335.1143  6052.4377           8           10
# ... with 40 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>
> dplyr::sample_n(sim.dat,10,replace=TRUE)
# A tibble: 10 x 19
     age gender    income house store_exp online_exp store_trans online_trans
   <int>  <chr>     <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    34 Female  73234.49    No  349.5491  2081.4476           4           21
 2    25 Female  90856.12    No  203.7759  2228.4818           4           23
 3    37   Male 187062.94   Yes 5931.7494  1942.1789          18           11
 4    34   Male  53945.69   Yes  370.5065  2305.3430           3           14
 5    23 Female  81763.92    No  205.6662  1040.8967           3           24
 6   300   Male 208017.46   Yes 5076.8009  6053.4853          12           11
 7    56   Male        NA   Yes  419.6702   192.3719           3            1
 8    26 Female  95341.78    No  198.9729  2036.4738           3           21
 9    26   Male  78240.93    No  430.2481  2091.4694           3           14
10    27 Female  90303.46    No  198.9020  1870.3866           6           13
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
#   Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
> dplyr::top_n(sim.dat,2,income)
# A tibble: 2 x 19
    age gender   income house store_exp online_exp store_trans online_trans
  <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
1    33   Male 319704.3   Yes  5998.305   4395.923           9           11
2    41   Male 317476.2   Yes  3029.844   4179.671          11           12
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
#   Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
> 

以及dplyr下的select()函數對列變量進行選擇(代碼略)
數據總結:(操作類似於apply()和ddply())

> dplyr::summarise(sim.dat,avg_online=mean(online_trans))
# A tibble: 1 x 1
  avg_online
       <dbl>
1     13.546

可以用group_by()函數根據某分類變量對觀測進行分組總結

生成新變量
mutate()函數可以進行列計算
transmute()函數與mutate()類似

> dplyr::mutate(sim.dat,total_exp=store_exp+online_exp)
# A tibble: 1,000 x 20
     age gender   income house store_exp online_exp store_trans online_trans
   <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4   Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1   Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3   Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3   Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6   Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5   Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3   Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0   Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5   Yes  608.2310   142.5503           6            1
10    60   Male 105048.8   Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 12 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>, total_exp <dbl>

合併數據集

> x=data.frame(cbind(ID=c("A","B","C"),x1=c(1,2,3)))
> y=data.frame(cbind(ID=c("B","C","D"),y1=c(T,T,F)))
> x
  ID x1
1  A  1
2  B  2
3  C  3
> y
  ID    y1
1  B  TRUE
2  C  TRUE
3  D FALSE
> left_join(x,y,by="ID")
  ID x1   y1
1  A  1 <NA>
2  B  2 TRUE
3  C  3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> inner_join(x,y,by="ID")
  ID x1   y1
1  B  2 TRUE
2  C  3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> full_join(x,y,by="ID")
  ID   x1    y1
1  A    1  <NA>
2  B    2  TRUE
3  C    3  TRUE
4  D <NA> FALSE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> semi_join(x,y,by="ID")
  ID x1
1  B  2
2  C  3
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> anti_join(x,y,by="ID")
  ID x1
1  A  1
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> 

此外,dplur包中還有對數據框交,並,補的運算(intersect(),union(),setdiff()),以及一個數據框按行或列加到另一個數據框(bind_rows(),bind_cols())等

數據整形

reshape2包
數據先通過melt()函數將數據揉開,再通過dcast()函數將數據重塑成想要的形狀。
melt()函數能糅合數據框,列表,矩陣,表格等。

tidyr包
首先gather()函數,類似於melt()
spread()函數和gather()函數相反,後者將不同的列堆疊起來,前者將同一列分開。
separate()和unite()也是tidyr包中兩個互補函數,separate()可以將不同列分開成多列,unite()能將不同的列合併在一起。類似於paste()函數。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章