知乎上一篇帖子“有什麼相見恨晚的背單詞方法?”提到學好英語首先要過7000單詞詞彙量這道關,而網上看到很多關於“100個句子背7000單詞”的故事。可當我下載這100個句子後感覺這裏遠遠沒有7000個單詞。因而就想用python來確認一下 自己的感覺是否正確。
下面用Python寫了幾行代碼來統計這100個句子到底有多少單詞。
100個句子已經下載,若沒有下載或感興趣的朋友可以查看這裏 100個句子背7000單詞...
代碼實現單詞統計的簡單分析:
1. 讀取文件內容,直接使用open()函數
2. 文件內容可能不完全是英語,因而需要提取英語單詞,這裏採用正則方式來提取
3. 提取單詞後生成一個列表,這樣便於統計
代碼如下: 也可以直接訪問Github獲得全部代碼和數據
import re #Read data from text document word_data = open("blog_en_sent_100.txt") word_list = [] sent_length = [] for line in word_data.readlines(): #Get all the English sentences row by row sentence = re.findall(r"[A-Za-z]+",line) if sentence: word_list.append(sentence) sent_length.append(len(sentence)) print("Maxinum and minium words length of one sentence:", max(sent_length), 'and', min(sent_length)) #Use "for" formula to get single word list from word_list words = [word.lower() for sent in word_list for word in sent] print ("Total words quantity: ",len(words),'\n', "Actual Words quantity: ",len(set(words))) #print ("Acutal words:\n",sorted(set(words))) #If the word length <=3, it was considered common words common_words = [word for word in set(words) if len(word) <=3] print ('Common words: ',len(common_words)) #print ("Common words:\n",sorted(common_words)) uncommon_words = [word for word in set(words) if word not in common_words] print("%s English words after delete common words:\n%s" %(len(uncommon_words), sorted(uncommon_words)))
最後輸出結果:
Maxinum and minium words length of sentences: 78 and 1
Total words quantity: 2493
Actual Words quantity: 1143
Common words: 91
1052 English words after delete common words:
['abandoned', 'ability', 'able', 'about', 'abstract', 'abundant', 'abundantly', 'according', 'accurate', 'acids', 'acquisitions', 'across', 'action', 'activities', 'actually', 'adding', 'advanced', ...... .......'withstand', 'woman', 'words', 'world',
'worldwide', 'worth', 'would', 'write', 'writing', 'xenon', 'years', 'yield', 'york', 'young']
爲節省空間,只選擇輸出不常見的部分單詞,有興趣的可以自己運行一下獲取全部單詞。
故事驗證完成,所謂的100句子背7000單詞,正兒八經的單詞只有1143個,加上句子重複的單詞也只有2493,而如果去掉常見詞只得到區區1052個單詞。
哈哈哈。。。 多學知識還是蠻管用的,至少可以不容易被忽悠。