題目:
Write a bash script to calculate the frequency of each word in a text file words.txt
.
For simplicity sake, you may assume:
words.txt
contains only lowercase characters and space' '
characters.- Each word must consist of lowercase characters only.
- Words are separated by one or more whitespace characters.
For example, assume that words.txt
has the following content:
the day is sunny the the the sunny is isYour script should output the following, sorted by descending frequency:
the 4 is 3 sunny 2 day 1
Note:
Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.
第一次寫的:
1
2
|
# Read from the file words.txt and output the word frequency list to stdout. sed 's/ /\n/g' words.txt | sort
| uniq -c | sort -r | awk '{print $2 " " $1}' |
思想:
(1)通過sed命令將空格轉換成換行符——>(2)將得到的結果用sort命令來排序——>(3)然後用uniq -c命令來統計每個單詞出現的次數——>(4)將得到的結果用sort -r命令來逆序排序——>(5)用awk重新排版
報錯:
錯誤原因:
忽略了多個空格或者tab的影響,如果兩個單詞之間有多個空格,sed命令只會把一個空格當作分隔符
第二次:
1
2
|
# Read from the file words.txt and output the word frequency list to stdout. sed 's/ /\n/g' words.txt | sed '/^\s*$/d' | sort | uniq -c | sort -r | awk '{print $2 " " $1}' |
由於這道題只有空格,沒有tab,在(1)和(2)之間加入去空格行的代碼
還是報錯:
錯誤原因:
can 13應該是在最前面的,結果排到了最後。說明是排序命令出現錯誤。
第三次:
1
2
|
# Read from the file words.txt and output the word frequency list to stdout. sed 's/ /\n/g' words.txt | sed '/^\s*$/d' | sort | uniq -c | sort -rn | awk '{print $2 " " $1}' |
Accepted
看來確實是排序命令沒有用對,加一個-n選項,可以按照出現次數(注意:uniq -c輸出的格式是:次數 單詞)排序,這樣就不會出現之前的狀況了。
本題知識點:
一、sed轉換,
(1)將空格轉換成回車: sed 's/ /\n/g'
(2)將多個空格行刪除:
sed '/^\s*$/d'
還可以用awk
NF 或者 awk
'!/^$/' 或者 tr
-s '\n' ;
二、sort排序
(1)sort -r 逆序排列
-r, --reverse reverse the result of comparisons
--sort=WORD sort according to WORD:
general-numeric -g, human-numeric -h, month -M,
numeric -n, random -R, version -V
(2)sort -n 按字符串的數值排列,幫助文檔:“ compare according to string numerical value”
三、uniq查重
我們通過uniq --help命令,查看uniq的幫助文檔,有如下提示:
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.
可以看到uniq只能檢測到相鄰的重複,所以我們在uniq之前先用sort命令排序,這樣可以使重複的單詞相鄰,方便我們用uniq統計其重複次數。當然,我們也可以用sort -u來達到同樣的目的。
四、awk排版
因爲程序經sort -rn的輸出格式是:次數 單詞,因此我們需要排版,用awk命令(默認的分隔符是空格),將第一列和第二列顛倒即可。
本題擴展:
如果文件中有tab鍵該如何寫shell?
1
2
3
|
# Read from the file words.txt and output the word frequency list to stdout. sed 's/ /\n/g' words.txt | sed -e '/^\s*$/d' -e 's/\t*//g' \ | sort | uniq -c | sort -rn | awk '{print $2 " " $1}' |
對,只需用sed命令將一個或多個tab換成空即可,這裏注意sed如果要多條命令同時執行,用-e選項