Sklearn專題實戰——針對Category特徵進行分類

原創

2020-06-09 05:07

文章目錄

1.前言

上次回我們是對文本進行情感分類，這次將實戰一個稍微複雜的Category分類，即針對每個文本分類處是屬於什麼類型的文本，如屬於電子類、服裝類等等。

2.數據處理

class Category:
    ELECTRONICS = "ELECTRONICS"
    BOOKS = "BOOKS"
    CLOTHING = "CLOTHING"
    GROCERY = "GROCERY"
    PATIO = "PATIO"
    
class Sentiment:
    POSITIVE = "POSITIVE"
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    
class Review:
    def __init__(self, category, text, score):
        self.category = category
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE
    
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    # 將get_text()用vectorizer.transform轉化
    def get_x(self, vectorizer):
        return vectorizer.transform(self.get_text())   
    
    def get_y(self):
        return [x.sentiment for x in self.reviews]
    
    def get_category(self):
        return [x.category for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        #print(len(positive_shrunk))
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        #print(self.reviews[0])

然後讀取五個文件的數據並將它們放在一個變量中：

file_names = ['category/Electronics_small.json', 'category/Books_small.json','category/Clothing_small.json',
              'category/Grocery_small.json', 'category/Patio_small.json']
file_categories = [Category.ELECTRONICS, Category.BOOKS, Category.CLOTHING, Category.GROCERY, Category.PATIO]

reviews = []
for i in range(len(file_names)):
    file_name = file_names[i]
    category = file_categories[i]
    with open(file_name) as f:
        for line in f:
            review_json = json.loads(line)   #解碼
            review = Review(category, review_json["reviewText"], review_json["overall"])
            reviews.append(review)

再進行訓練集測試集拆分，並分別拿到對應的特徵和標籤：

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(train)   
test_container = ReviewContainer(test)
train_container.evenly_distribute()
test_container.evenly_distribute()

corpus = train_container.get_text()
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)   # 對於訓練數據需要先fit

train_x = train_container.get_x(vectorizer)
train_y = train_container.get_category()

test_x = test_container.get_x(vectorizer)
test_y = test_container.get_category()

3.模型構建

3.1.支持向量機

from sklearn.svm import SVC

clf = SVC(C=16, kernel="linear", gamma="auto")
clf.fit(train_x, train_y)
print(clf.score(test_x, test_y))
print(f1_score(test_y, clf.predict(test_x),
               average=None, labels=[Category.ELECTRONICS,
                                     Category.BOOKS,Category.CLOTHING,Category.GROCERY,Category.PATIO]))

3.2.貝葉斯

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x.todense(), train_y)
print(gnb.score(test_x.todense(), test_y))
print(f1_score(test_y, gnb.predict(test_x.todense()),average=None, labels=[Category.ELECTRONICS,
                                     Category.BOOKS,Category.CLOTHING,Category.GROCERY,Category.PATIO]))

4.網格搜索尋找最優結果

from sklearn.model_selection import GridSearchCV

parameters = {'kernel':("linear","rbf"), "C":[0.1,1,8,16,32]}
svc = SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x, train_y)

print(clf.score(test_x, test_y))

5.保存模型+提取模型

import pickle

with open("category.pkl", "wb") as f:
    pickle.dump(clf, f)

with open("vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

with open("category.pkl", "rb") as f:
    clf_loaded = pickle.load(f)
with open("vectorizer.pkl", "rb") as f:
    vectorizer = pickle.load(f)

test_set = ["very quick speeds", "loved the dress","bad phone"]
new_test = vectorizer.transform(test_set)

clf_loaded.predict(new_test)

6.混淆矩陣查看分類效果

from sklearn.metrics import confusion_matrix
import seaborn as sn
import pandas as pd

y_pred = clf.predict(test_x)
labels = [Category.ELECTRONICS,Category.BOOKS,Category.CLOTHING,Category.GROCERY,Category.PATIO]

cm = confusion_matrix(test_y, y_pred, labels= labels)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sn.heatmap(df_cm, annot=True, fmt='d')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Sklearn專題實戰——針對Category特徵進行分類

文章目錄

1.前言

2.數據處理

3.模型構建

3.1.支持向量機

3.2.貝葉斯

4.網格搜索尋找最優結果

5.保存模型+提取模型

6.混淆矩陣查看分類效果

《日本蠟燭圖》讀書筆記 & 技術分析回測

Python多線程編程深度探索：從入門到實戰

《期貨-市場技術分析》讀書筆記

mongodb處理json數據很好

頂級 Javaer 都在用的 20 個類庫，真香！

[轉帖]cpupower

google瀏覽器插件開發

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

劍指offer面試題63. 股票的最大利潤(動態規劃)

劍指offer面試題61. 撲克牌中的順子(排序)(遍歷)

Sklearn專題實戰——針對Category特徵進行分類

劍指offer面試題64. 求1+2+…+n(邏輯符短路)(遞歸)

劍指offer面試題65. 不用加減乘除做加法(位運算)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結