博客原文:https://suzan.rbind.io/2018/01/dplyr-tutorial-1/ 作者:Suzan Baert
注意:所有代碼都將作爲管道的一部分呈現,即使它們中的任何一個都不是完整的管道。 在某些情況下,我添加了一個
glimpse()
語句,允許您查看輸出tibble中選擇的列,而不必每次都打印所有數據。
數據集
library(tidyverse) #built-in R dataset glimpse(msleep) ## Observations: 83 ## Variables: 11 ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo... ## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi... ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph... ## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38... ## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,... ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
選取列
選取列:基礎部分
如果目的是選擇其中幾列,只需在select語句中添加列的名稱即可。 添加它們的順序將決定它們在output中的顯示順序。
msleep %>% select(name, genus, sleep_total, awake) %>% glimpse() ## Observations: 83 ## Variables: 4 ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Great... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,... ## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, ...
如果你想添加很多列,可以通過使用chunks提高工作效率,取消選擇甚至取消選擇列並重新添加它來進行選擇 直接。
同時可以請使用start_col:end_col
語法選擇某些列:
msleep %>% select(name:order, sleep_total:sleep_cycle) %>% glimpse ## Observations: 83 ## Variables: 7 ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Great... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos... ## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi"... ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383...
另一種方法是通過在列名稱前添加減號來取消選擇列。 還可以通過此操作取消選擇某些列。
msleep %>% select(-conservation, -(sleep_total:awake)) %>% glimpse ## Observations: 83 ## Variables: 6 ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "... ## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c... ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "... ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0...
甚至可以取消選擇整個chunks列,然後重新添加其中某列。下面的示例代碼取消選擇從name到awake的所有列,但重新添加列'conservation',即使它是取消選擇的列的一部分。 但這隻適用於在同一select()
語句中。
msleep %>% select(-(name:awake), conservation) %>% glimpse ## Observations: 83 ## Variables: 3 ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4... ## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
根據列名特點選擇列
如果你有很多具有類似列名的列,你可以通過在select語句中添加starts_with()
,ends_with()
或contains()
來使用匹配。
msleep %>% select(name, starts_with("sleep")) %>% glimpse ## Observations: 83 ## Variables: 4 ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Great... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383... msleep %>% select(contains("eep"), ends_with("wt")) %>% glimpse ## Observations: 83 ## Variables: 5 ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383... ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.0... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.49...
根據正則表達式選擇列
以上的輔助函數都是使用精確的模式匹配。 如果你有列名模式並不精確相同,你可以在matches()
中使用任何正則表達式。下面的示例代碼將添加任何包含“o”的列,後跟一個或多個其他字母,以及“er”。
#selecting based on regex msleep %>% select(matches("o.+er")) %>% glimpse ## Observations: 83 ## Variables: 2 ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph... ## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
根據預先確定的列名選擇列
還有另一個選項可以避免連續重新輸入列名:one_of()
。 您可以預先設置列名,然後在select()
語句中通過將它們包裝在one_of()
中或使用!!
運算符來引用它們。
classification <- c("name", "genus", "vore", "order", "conservation") msleep %>% select(!!classification) ## # A tibble: 83 x 5 ## name genus vore order conservation ## <chr> <chr> <chr> <chr> <chr> ## 1 Cheetah Acinonyx carni Carnivora lc ## 2 Owl monkey Aotus omni Primates <NA> ## 3 Mountain beaver Aplodontia herbi Rodentia nt ## 4 Greater short-tailed shrew Blarina omni Soricomorpha lc ## 5 Cow Bos herbi Artiodactyla domesticated ## 6 Three-toed sloth Bradypus herbi Pilosa <NA> ## 7 Northern fur seal Callorhinus carni Carnivora vu ## 8 Vesper mouse Calomys <NA> Rodentia <NA> ## 9 Dog Canis carni Carnivora domesticated ## 10 Roe deer Capreolus herbi Artiodactyla lc ## # ... with 73 more rows
根據數據類型選擇列
select_if
函數允許您傳遞返回邏輯語句的函數。 例如,您可以使用select_if(is.character)
選擇所有字符串列。 同樣,你可以添加is.numeric
,is.integer
,is.double
,is.logical
,is.factor
。如果你有日期列,你可以加載lubridate
包,並使用is.POSIXt
或is.Date
。
msleep %>% select_if(is.numeric) %>% glimpse ## Observations: 83 ## Variables: 6 ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383... ## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, ... ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.0... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.49...
您也可以選擇否定,但在這種情況下,您需要添加波形符以確保仍將函數傳遞給select_if
。 select_all / if / at函數需要將函數作爲參數傳遞。 如果你必須添加任何否定或參數,你必須將你的函數包裝在funs()
中,或者在重新創建函數之前添加波形符。
msleep %>% select_if(~!is.numeric(.)) %>% glimpse ## Observations: 83 ## Variables: 5 ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo... ## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi... ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph... ## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
按邏輯表達式選擇列
實際上,select_if
允許您根據任何邏輯函數進行選擇,而不僅僅基於數據類型。 例如,可以選擇平均值大於500的所有列。 爲避免錯誤,您還必須僅選擇數字列,您可以提前執行此操作以獲得更簡單的語法,也可以在同一行中執行。類似地,'mean> 500本身不是一個函數,所以你需要先添加一個代字號,或者將它包裝在
funs()`中以將語句轉換爲函數。
msleep %>% select_if(is.numeric) %>% select_if(~mean(., na.rm=TRUE) > 10)
或者更簡便:
msleep %>% select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 10) ## # A tibble: 83 x 3 ## sleep_total awake bodywt ## <dbl> <dbl> <dbl> ## 1 12.1 11.9 50.0 ## 2 17.0 7.00 0.480 ## 3 14.4 9.60 1.35 ## 4 14.9 9.10 0.0190 ## 5 4.00 20.0 600 ## 6 14.4 9.60 3.85 ## 7 8.70 15.3 20.5 ## 8 7.00 17.0 0.0450 ## 9 10.1 13.9 14.0 ## 10 3.00 21.0 14.8 ## # ... with 73 more rows
select_if
的另一個有用功能是n_distinct()
,它計算可以在列中找到的不同值的數量。例如,要返回少於10個不同答案的列,請在select_if語句中傳遞~n_distinct(。)<10
。 鑑於n_distinct(。)<10
不是函數,你需要在前面放一個波浪號。
msleep %>% select_if(~n_distinct(.) < 10) ## # A tibble: 83 x 2 ## vore conservation ## <chr> <chr> ## 1 carni lc ## 2 omni <NA> ## 3 herbi nt ## 4 omni lc ## 5 herbi domesticated ## 6 herbi <NA> ## 7 carni vu ## 8 <NA> <NA> ## 9 carni domesticated ## 10 herbi lc ## # ... with 73 more rows
對列重新排序
您可以使用select()
函數(見下文)重新排序列。 您選擇它們的順序將決定最終的順序。
msleep %>% select(conservation, sleep_total, name) %>% glimpse ## Observations: 83 ## Variables: 3 ## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1... ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
如果你只是想將幾列移到前面,你可以在之後使用everything()
這將簡便地添加所有剩餘的列。
msleep %>% select(conservation, sleep_total, everything()) %>% glimpse ## Observations: 83 ## Variables: 11 ## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1... ## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo... ## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi... ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38... ## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,... ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
列名
有時候列名稱本身需要進行更改:
重命名列
如果您將使用select()
語句,則可以在select
函數中直接重命名。
msleep %>% select(animal = name, sleep_total, extinction_threat = conservation) %>% glimpse ## Observations: 83 ## Variables: 3 ## $ animal <chr> "Cheetah", "Owl monkey", "Mountain beaver", ... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0,... ## $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "v...
如果要保留所有列,因此不能使用select()
語句,可以通過添加rename()
語句來重命名。
msleep %>% rename(animal = name, extinction_threat = conservation) %>% glimpse ## Observations: 83 ## Variables: 11 ## $ animal <chr> "Cheetah", "Owl monkey", "Mountain beaver", ... ## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina"... ## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "... ## $ order <chr> "Carnivora", "Primates", "Rodentia", "Sorico... ## $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "v... ## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0,... ## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, N... ## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667,... ## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, ... ## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, N... ## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850,...
格式化所有列名
select_all()
函數允許更改所有列,並將函數作爲參數。如果想以大寫形式獲取所有列名,可以使用toupper()
,同樣可以使用小寫tolower()
。
msleep %>% select_all(toupper) ## # A tibble: 83 x 11 ## NAME GENUS VORE ORDER CONSERVATION SLEEP_TOTAL SLEEP_REM SLEEP_CYCLE ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 Cheet~ Acin~ carni Carn~ lc 12.1 NA NA ## 2 Owl m~ Aotus omni Prim~ <NA> 17.0 1.80 NA ## 3 Mount~ Aplo~ herbi Rode~ nt 14.4 2.40 NA ## 4 Great~ Blar~ omni Sori~ lc 14.9 2.30 0.133 ## 5 Cow Bos herbi Arti~ domesticated 4.00 0.700 0.667 ## 6 Three~ Brad~ herbi Pilo~ <NA> 14.4 2.20 0.767 ## 7 North~ Call~ carni Carn~ vu 8.70 1.40 0.383 ## 8 Vespe~ Calo~ <NA> Rode~ <NA> 7.00 NA NA ## 9 Dog Canis carni Carn~ domesticated 10.1 2.90 0.333 ## 10 Roe d~ Capr~ herbi Arti~ lc 3.00 NA NA ## # ... with 73 more rows, and 3 more variables: AWAKE <dbl>, BRAINWT <dbl>, ## # BODYWT <dbl>
你可以通過動態創建函數來進一步:如果你有來自excel雜亂的列名,你可以用下劃線替換所有的空格。
#making an unclean database: msleep2 <- select(msleep, name, sleep_total, brainwt) colnames(msleep2) <- c("name", "sleep total", "brain weight") msleep2 %>% select_all(~str_replace(., " ", "_")) ## # A tibble: 83 x 3 ## name sleep_total brain_weight ## <chr> <dbl> <dbl> ## 1 Cheetah 12.1 NA ## 2 Owl monkey 17.0 0.0155 ## 3 Mountain beaver 14.4 NA ## 4 Greater short-tailed shrew 14.9 0.000290 ## 5 Cow 4.00 0.423 ## 6 Three-toed sloth 14.4 NA ## 7 Northern fur seal 8.70 NA ## 8 Vesper mouse 7.00 NA ## 9 Dog 10.1 0.0700 ## 10 Roe deer 3.00 0.0982 ## # ... with 73 more rows
或者,如果您的列包含其他數據,例如問題編號:
#making an unclean database: msleep2 <- select(msleep, name, sleep_total, brainwt) colnames(msleep2) <- c("Q1 name", "Q2 sleep total", "Q3 brain weight") msleep2[1:3,] ## # A tibble: 3 x 3 ## `Q1 name` `Q2 sleep total` `Q3 brain weight` ## <chr> <dbl> <dbl> ## 1 Cheetah 12.1 NA ## 2 Owl monkey 17.0 0.0155 ## 3 Mountain beaver 14.4 NA
您可以將select_all
與str_replace
結合使用以消除額外的字符。
msleep2 %>% select_all(~str_replace(., "Q[0-9]+", "")) %>% select_all(~str_replace(., " ", "_")) ## # A tibble: 83 x 3 ## `_name` `_sleep total` `_brain weight` ## <chr> <dbl> <dbl> ## 1 Cheetah 12.1 NA ## 2 Owl monkey 17.0 0.0155 ## 3 Mountain beaver 14.4 NA ## 4 Greater short-tailed shrew 14.9 0.000290 ## 5 Cow 4.00 0.423 ## 6 Three-toed sloth 14.4 NA ## 7 Northern fur seal 8.70 NA ## 8 Vesper mouse 7.00 NA ## 9 Dog 10.1 0.0700 ## 10 Roe deer 3.00 0.0982 ## # ... with 73 more rows
行名轉換成列
某些數據框的行名本身有意義,例如mtcars數據集:
mtcars %>% head ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
如果希望此列成爲實際列,則可以使用rownames_to_column()
函數,並指定新列名。
mtcars %>% tibble::rownames_to_column("car_model") %>% head ## car_model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1