【Leetcode Shell】Word Frequency

題目：

Write a bash script to calculate the frequency of each word in a text file words.txt.

For simplicity sake, you may assume:

words.txt contains only lowercase characters and space ' ' characters.
Each word must consist of lowercase characters only.
Words are separated by one or more whitespace characters.

For example, assume that words.txt has the following content:

the day is sunny the the
the sunny is is

Your script should output the following, sorted by descending frequency:

the 4
is 3
sunny 2
day 1

Note:
Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.

第一次寫的：

1 2	`# Read from the file words.txt` `and` `output the word frequency list to stdout.` `sed` `'s/ /\n/g'` `words.txt \| sort \| uniq -c \| sort -r \| awk` `'{print $2 " " $1}'`

思想：

(1)通過sed命令將空格轉換成換行符——>(2)將得到的結果用sort命令來排序——>(3)然後用uniq -c命令來統計每個單詞出現的次數——>(4)將得到的結果用sort -r命令來逆序排序——>(5)用awk重新排版

報錯：

錯誤原因：

忽略了多個空格或者tab的影響，如果兩個單詞之間有多個空格，sed命令只會把一個空格當作分隔符

第二次：

1 2	`# Read from the file words.txt` `and` `output the word frequency list to stdout.` `sed` `'s/ /\n/g'` `words.txt \| sed` `'/^\s*$/d'` `\| sort \| uniq -c \| sort -r \| awk` `'{print $2 " " $1}'`

由於這道題只有空格，沒有tab，在(1)和(2)之間加入去空格行的代碼

還是報錯：

錯誤原因：

can 13應該是在最前面的，結果排到了最後。說明是排序命令出現錯誤。

第三次：

1 2	`# Read from the file words.txt` `and` `output the word frequency list to stdout.` `sed` `'s/ /\n/g'` `words.txt \| sed` `'/^\s*$/d'` `\| sort \| uniq -c \| sort -rn \| awk` `'{print $2 " " $1}'`

Accepted

看來確實是排序命令沒有用對，加一個-n選項，可以按照出現次數（注意：uniq -c輸出的格式是：次數單詞）排序，這樣就不會出現之前的狀況了。

本題知識點：

一、sed轉換，

（1）將空格轉換成回車： sed 's/ /\n/g'

（2）將多個空格行刪除： sed '/^\s*$/d'；還可以用awk NF 或者 awk '!/^$/' 或者 tr -s '\n'

二、sort排序

（1）sort -r 逆序排列

-r, --reverse reverse the result of comparisons

--sort=WORD sort according to WORD:

general-numeric -g, human-numeric -h, month -M,

numeric -n, random -R, version -V

（2）sort -n 按字符串的數值排列，幫助文檔：“ compare according to string numerical value”

三、uniq查重

我們通過uniq --help命令，查看uniq的幫助文檔，有如下提示：

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.

可以看到uniq只能檢測到相鄰的重複，所以我們在uniq之前先用sort命令排序，這樣可以使重複的單詞相鄰，方便我們用uniq統計其重複次數。當然，我們也可以用sort -u來達到同樣的目的。

四、awk排版

因爲程序經sort -rn的輸出格式是：次數單詞，因此我們需要排版，用awk命令（默認的分隔符是空格），將第一列和第二列顛倒即可。

本題擴展：

如果文件中有tab鍵該如何寫shell?

# Read from the file words.txt and output the word frequency list to stdout.

sed 's/ /\n/g' words.txt | sed -e '/^\s*$/d' -e 's/\t*//g' \

| sort | uniq -c | sort -rn | awk '{print $2 " " $1}'

對，只需用sed命令將一個或多個tab換成空即可，這裏注意sed如果要多條命令同時執行，用-e選項

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Leetcode Shell】Word Frequency

一個簡單的MD5加鹽

C# 代碼學習

藍橋15屆stema編程題密碼鎖-動態規劃 C++和Python最後一道題

2021看雪SDC議題回顧 | SaTC：一種全新的物聯網設備漏洞自動化挖掘方法

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

C#/.NET/.NET Core優秀項目和框架2024年4月簡報

HTTP URL 詳解

得物 ZooKeeper SLA 也可以 99.99%

【Leetcode Shell】Transpose File

【Leetcode Database】Delete Duplicate Emails

【Leetcode Shell】Word Frequency

【Leetcode Shell】Valid Phone Numbers

【Leetcode Database】Duplicate Emails

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結