【精通特徵工程】學習筆記（二）

【精通特徵工程】學習筆記Day2&2.5&D3章&P33-頁

3、文本數據:扁平化、過濾和分塊

3.1 元素袋:將自然文本轉換爲扁平向量

3.1.1 詞袋

一個特徵就是一個單詞，一個特徵向量由這個單詞在每篇文檔中出現的次數組成

3.1.2 n 元詞袋

n-gram(n 元詞)是由 n 個標記 (token)組成的序列。
1-gram 就是一個單詞(word)，又稱爲一元詞(unigram)。
n越大，能表示的信息越豐富，相應的成本也會越高。

eg：計算 n-gram

>>> import pandas
      >>> import json
      >>> from sklearn.feature_extraction.text import CountVectorizer
# 加載前10 000條點評
>>> f = open('data/yelp/v6/yelp_academic_dataset_review.json') >>> js = []
>>> for i in range(10000):
... js.append(json.loads(f.readline()))
>>> f.close()
>>> review_df = pd.DataFrame(js)
# 創建一元詞、二元詞和三元詞的特徵轉換器。
# 默認情況下，會忽略單字母詞，這非常有實際意義， # 因爲會除去無意義的詞。但在這個例子中，
# 出於演示的目的，我們會顯式地包含這些詞。

>>> bow_converter = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
>>> bigram_converter = CountVectorizer(ngram_range=(2,2),
...                                    token_pattern='(?u)\\b\\w+\\b')
>>> trigram_converter = CountVectorizer(ngram_range=(3,3),
...                                     token_pattern='(?u)\\b\\w+\\b')
# 擬合轉換器，查看詞彙表大小
>>> bow_converter.fit(review_df['text'])
>>> words = bow_converter.get_feature_names()
>>> bigram_converter.fit(review_df['text'])
>>> bigrams = bigram_converter.get_feature_names() >>> trigram_converter.fit(review_df['text'])
>>> trigrams = trigram_converter.get_feature_names() >>> print (len(words), len(bigrams), len(trigrams)) 26047 346301 847545
# 看一下n-gram
>>> words[:10]
['0', '00', '000', '0002', '00am', '00ish', '00pm', '01', '01am', '02']
>>> bigrams[-10:]
['zucchinis at',
 'zucchinis took',
 'zucchinis we',
 'zuma over',
 'zuppa di',
 'zuppa toscana',
 'zuppe di',
 'zurich and',
 'zz top',
 'à la']
>>> trigrams[:10]
['0 10 definitely',
 '0 2 also',
 '0 25 per',
 '0 3 miles',
 '0 30 a',
 '0 30 everything',
 '0 30 lb',
 '0 35 tip',
 '0 5 curry',
'0 5 pork']

Yelp 數據集前 10 000 條點評中唯一 n-gram 的數量：

3.2 使用過濾獲取清潔特徵

3.2.1 停用詞

停用詞列表

3.2.2 基於頻率的過濾

高頻詞
罕見詞

3.2.3 詞幹提取

eg：Python 的 NLTK 包運行 Porter stemmer 的例子。它適用於很多情況，但不是萬能的。
如：“goes”被映射到了“goe”，而“go”被映射到了它本身。

>>> import nltk
>>> stemmer = nltk.stem.porter.PorterStemmer()
>>> stemmer.stem('flowers')
u'flower'
>>> stemmer.stem('zeroes')
u'zero'
>>> stemmer.stem('stemmer')
u'stem'
>>> stemmer.stem('sixties')
u'sixti'
>>> stemmer.stem('sixty')
u'sixty'
>>> stemmer.stem('goes')
u'goe'
>>> stemmer.stem('go')
u'go'

詞幹提取並不是非做不可

3.3 意義的單位:從單詞、n 元詞到短語

3.3.1 解析與分詞

解析
半結構化文檔，比如 JSON 字符串或 HTML 頁面
網頁，那麼解析程序還需要處理 URL
電子郵件，像發件人、收件人和標題這些域都需要特殊處理
否則這些信息在最終計數中就會和普通詞一樣，也就失去作用了
分詞
空格
標點符號

3.3.2 通過搭配提取進行短語檢測

基於頻率的方法
用於搭配提取的假設檢驗

通過似然比檢驗這種分析方法來檢測常見短語的算法如下：
(1) 計算出所有單詞的出現概率:P(w)。
(2) 對所有的唯一二元詞，計算出成對單詞出現的條件概率:P(w2 | w1)。
(3) 對所有的唯一二元詞，計算出似然比 log λ。
(4) 按照似然比爲二元詞排序。
(5) 將似然比最小的二元詞作爲特徵。

文本分塊和詞性標註

文本分塊要比找出 n 元詞複雜一些，它要使用基於規則的模型並基於詞性生成標記序列。
爲了找出這些短語，我們先切分出所有帶詞性的單詞，然後檢查這些標記的鄰近詞，找出按詞性組合的詞組，這些詞組又稱爲“塊”。將單詞映射到詞性的模型通常與特定的語言有關。一些開源的 Python 程序庫(比如 NLTK、spaCy 和
TextBlob)中帶有適用於多種語言的模型。
eg：詞性標註和文本分塊

>>> import pandas as pd
      >>> import json
# 加載前10條點評
>>> f = open('data/yelp/v6/yelp_academic_dataset_review.json') >>> js = []
>>> for i in range(10):
... js.append(json.loads(f.readline()))
>>> f.close()
>>> review_df = pd.DataFrame(js)
# 首先使用spaCy中的函數 >>> import spacy
# 預先加載語言模型
>>> nlp = spacy.load('en')
# 我們可以創建一個spaCy nlp變量的Pandas序列 >>> doc_df = review_df['text'].apply(nlp)
# spaCy可以使用(.pos_)提供細粒度的詞性，
# 使用(.tag_)提供粗粒度的詞性
>>> for doc in doc_df[4]:
... print([doc.text, doc.pos_, doc.tag_])
Got VERB VBP
a DET DT
letter NOUN NN
in ADP IN
the DET DT
mail NOUN NN
last ADJ JJ
week NOUN NN
that ADJ WDT
said VERB VBD
Dr. PROPN NNP
Goldberg PROPN NNP
is VERB VBZ
moving VERB VBG
to ADP IN
Arizona PROPN NNP
to PART TO
take VERB VB
a DET DT
new ADJ JJ
position NOUN NN
there ADV RB
in ADP IN
June PROPN NNP
. PUNCT .
  SPACE SP
He PRON PRP
will VERB MD
be VERB VB
missed VERB VBN
very ADV RB
much ADV RB
. PUNCT .
SPACE SP
I PRON PRP
think VERB VBP
finding VERB VBG
a DET DT
new ADJ JJ
doctor NOUN NN
in ADP IN
NYC PROPN NNP
that ADP IN
you PRON PRP
actually ADV RB
like INTJ UH
might VERB MD
almost ADV RB
be VERB VB
as ADV RB
awful ADJ JJ
as ADP IN
trying VERB VBG
to PART TO
find VERB VB
a DET DT
date NOUN NN
! PUNCT .


# spaCy還可以進行基本的名詞分塊
>>> print([chunk for chunk in doc_df[4].noun_chunks])
[a letter, the mail, Dr. Goldberg, Arizona, a new position, June, He, I, a new doctor, NYC, you, a date]
#####
# 我們還可以使用TextBlob實現同樣的特徵轉換 from textblob import TextBlob
# TextBlob中的默認標記器使用PatternTagger，在這個例子中是沒有問題的。 # 你還可以指定使用NLTK標記器，它對於不完整的句子效果更好。
>>> blob_df = review_df['text'].apply(TextBlob)
>>> blob_df[4].tags
[('Got', 'NNP'),
('a', 'DT'),
('letter', 'NN'),
('in', 'IN'),
('the', 'DT'),
('mail', 'NN'),
('last', 'JJ'),
('week', 'NN'),
('that', 'WDT'),
('said', 'VBD'),
('Dr.', 'NNP'),
('Goldberg', 'NNP'),
('is', 'VBZ'),
('moving', 'VBG'),
('to', 'TO'),
('Arizona', 'NNP'),
('to', 'TO'),
('take', 'VB'),
('a', 'DT'),
('new', 'JJ'),
('position', 'NN'),
('there', 'RB'),
('in', 'IN'),
('June', 'NNP'),
('He', 'PRP'),
('will', 'MD'),
('be', 'VB'),
('missed', 'VBN'),
('very', 'RB'),
('much', 'JJ'),
('I', 'PRP'),
('think', 'VBP'),
('finding', 'VBG'),
('a', 'DT'),
('new', 'JJ'),
('doctor', 'NN'),
('in', 'IN'),
('NYC', 'NNP'),
('that', 'IN'),
('you', 'PRP'),
('actually', 'RB'),
('like', 'IN'),
('might', 'MD'),
('almost', 'RB'),
('be', 'VB'),
('as', 'RB'),
('awful', 'JJ'),
('as', 'IN'),
('trying', 'VBG'),
('to', 'TO'),
('find', 'VB'),
('a', 'DT'),
('date', 'NN')]
>>> print([np for np in blob_df[4].noun_phrases])
['got', 'goldberg', 'arizona', 'new position', 'june', 'new doctor', 'nyc']

參考：《精通特徵工程》愛麗絲·鄭·阿曼達·卡薩麗

面向機器學習的特徵工程學習筆記：
【精通特徵工程】學習筆記（一）

【精通特徵工程】學習筆記（二）

【精通特徵工程】學習筆記Day2&2.5&D3章&P33-頁

3、文本數據:扁平化、過濾和分塊

3.1 元素袋:將自然文本轉換爲扁平向量

3.1.1 詞袋

3.1.2 n 元詞袋

3.2 使用過濾獲取清潔特徵

3.2.1 停用詞

3.2.2 基於頻率的過濾

3.2.3 詞幹提取

3.3 意義的單位:從單詞、n 元詞到短語

3.3.1 解析與分詞

解析

分詞

3.3.2 通過搭配提取進行短語檢測

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

Dokcer部署Kafka集羣

【Linux命令學習】lsof查看打開的文件

python、matlab調用tushare數據

【實用小站】配色網站

【精通特徵工程】學習筆記（二）

python讀寫數據文件

產品分析數據來源渠道

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結