linux命令實現詞頻統計

原創

2020-06-11 07:18

問題

給定示例文件test.txt如下，對第一列做詞頻統計並排序。

hello marry
max thread
hello lihua
max apple
max code
nasa connection

解答

切割->分組->排序，cat test.txt | cut -d ' ' -f1 | sort | uniq -c | sort -k 1

      1 nasa
      2 hello
      3 max

注意：上面在分組前進行一次sort操作並不是多餘的，假如不sort，uniq只會對相鄰相同的單詞分組。

拓展

獲取不重複的word個數

cat test.txt | cut -d ' ' -f1 | sort | uniq -c | wc -l

大數據環境

上述的wordcount在海量數據情況下適用嗎？答：並不適用。因爲sort命令採用了歸併排序，排序時候的臨時小文件是默認放在/tmp路徑下的，有時候/tmp的空間有限制，比如4G，那麼，超過4G的文件就沒有辦法用sort了。當然也可以用sort -T Path 來臨時文件的目錄。見參考博文1。

參考

[1] 大數據量下的sort-linux

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

CentOS7下配置Nginx

背景最近倒騰服務器的時候，選擇了CentOS7操作系統，在安裝配置Nginx的時候遇到了Permission Denied問題。按照chown和chmod進行配置無果，後來定位到SELinux問題。 SELinux是什麼？ When yo

2024-02-07 13:55:28

常用Linux命令、包括vi 、svn

PS: http://man.linuxde.net/vi /etc/init.d/network restart //=========================================== 更新腳本 cd /www/scr

2023-08-15 21:24:17

Linux環境下的主流技術部署（基於Docker容器）

搞了臺阿里雲服務器，準備學習下基於Docker容器的各種主流技術部署，那麼讓我們愉快的開始吧！ Docker環境安裝安裝yum-utils：yum install -y yum-utils device-mapper-pe

2023-02-25 00:27:07

BPF 和 Go: Linux 中的現代內省形式

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-20 11:08:55

10 款你不知道的 Linux 環境下的替代工具！

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-11 15:33:58

2022年，Rust 將成爲 Linux 內核第二官方語言？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-10 14:33:55

Linux 年度報告發布：2021 預計虧損300w，重點關注開源硬件

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-09 15:38:57

Android C++系列：Linux線程（一）概念

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 什麼是線程",

2021-12-08 11:33:58

在Linux系統發行版（以CentOS7爲例）上部署ElasticSearch集羣並啓動Kibana和Logstash

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

为自己带盐

2021-12-07 10:29:04

2021 專業人士 Linux 系統 TOP 5

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-06 10:13:57

微軟在Edge不斷作死：疑似阻止用戶下載谷歌；Linux 之父怒噴桌面版 Linux；滴滴出行美股退市靴子落地...傳阿里員工福利再升級，或全面試行靈活辦公...

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-06 10:03:56

Android C++系列：Linux信號（三）

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"可重入函數","at

2021-12-03 18:19:01

curl 作者怒噴蘋果，我不當工具人；國美通報批評摸魚員工，網友急了；Windows 11 恢復藍屏死機，熟悉的味道回來了

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-11-21 11:03:51

“殺死”CentOS ，替代品 Rocky Linux 8.5 發佈

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-20 20:43:53

“MSL”出爐？Ubuntu 發佈 Multipass 對標 WSL

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-12 10:08:56

24小時熱門文章

最新文章

最新評論文章