Sklearn專題實戰——數據處理+模型構建+網格搜索+保存(提取)模型

原創

2020-05-25 23:54

文章目錄

1.前言

針對Sklearn在前面已經通過代碼實戰講解了其中的各個主要模塊，現在將從整體的角度深度理解一下Sklearn, 本文主要以代碼形式講解，在代碼中有註釋，話不多說，開車！！！（請坐穩）

數據鏈接
密碼:a6vy

2.數據處理

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()   # 調用類內函數
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE       # 類的屬性調用（類間調用）
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

    
class ReviewContainer:        # 對訓練集、測試集處理 
    def __init__(self,reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]     # 將“text”放一起
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]  # 將“sentiment”放一起
    
    def evenly_distribute(self):        #  均勻分配數據
        negative = list(filter(lambda x : x.sentiment == Sentiment.NEGATIVE,self.reviews)) # 篩選NEGATIVE
        positive = list(filter(lambda x : x.sentiment == Sentiment.POSITIVE,self.reviews)) # 篩選POSITIVE
        positive_shrunk = positive[:len(negative)] #  切片，使積極的樣本與消極的樣本一樣多
        self.reviews = negative + positive_shrunk  # 最終樣本
        random.shuffle(self.reviews)     #洗牌
#filter() 函數用於過濾序列，過濾掉不符合條件的元素，返回一個迭代器對象，如果要轉換爲列表，可以使用 list() 來轉換
#接收兩個參數，第一個爲函數，第二個爲序列，序列的每個元素作爲參數傳遞給函數進行判，然後返回 True 或 False，最後將返回 True 的元素放到新列表中

接下來就是讀取數據並利用上面的類處理數據：

import json

reviews = []     
with open("books_small_10000.json") as f:
    for line in f:
        review = json.loads(line)     # 對數據進行解碼
        reviews.append(Review(review["reviewText"], review["overall"])) 

print(reviews[5].text)  # 類的函數調用
print(reviews[5].score)
print(reviews[5].sentiment)

再進行訓練集測試集拆分，並分別拿到對應的特徵和標籤：

from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42) # 拆分數據
train_container = ReviewContainer(training)  # 實例化訓練集對象
test_container = ReviewContainer(test)       # 實例化測試集對象

train_container.evenly_distribute()     # 先對訓練集取相同樣本再打亂
train_x = train_container.get_text()    # 取訓練數據 
train_y = train_container.get_sentiment()  # 取訓練標籤

test_container.evenly_distribute()      # 先對測試集取相同樣本再打亂
test_x = test_container.get_text()      # 取測試數據
test_y = test_container.get_sentiment() # 取測試標籤

print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))
# print(train_x_vectors[0])
# print(train_x_vectors[0].toarray())

最後用TfidfVectorizer把原始文本轉化爲tf-idf的特徵矩陣：

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x) # 對訓練數據用fit_transform
test_x_vectors = vectorizer.transform(test_x)       # 對測試數據用僅用transform
print(vectorizer.get_feature_names())

3.模型構建

3.1.支持向量機

from sklearn.svm import SVC
from sklearn.metrics import f1_score

clf_svm = SVC(kernel="linear")

clf_svm.fit(train_x_vectors, train_y)    # 訓練數據
print(clf_svm.score(test_x_vectors, test_y))  # 用測試數據計算模型分類效果
print(clf_svm.predict(test_x_vectors[0]))   #用訓練好的模型預測測試數據
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

3.2.決策樹

from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
print(clf_dec.score(test_x_vectors, test_y))
print(clf_dec.predict(test_x_vectors[0]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

3.3.邏輯迴歸

from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
print(clf_log.score(test_x_vectors, test_y))
print(clf_log.predict(test_x_vectors[0]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

4.網格搜索尋找最優結果

from sklearn.model_selection import GridSearchCV

parameters = {'kernel':("linear","rbf"), "C":(1,4,8,16,32)}
svc = SVC()
clf = GridSearchCV(svc, parameters, cv=5)  #五折交叉驗證
clf.fit(train_x_vectors, train_y)

print(clf.score(test_x_vectors, test_y))
print(f1_score(test_y, clf.predict(test_x_vectors),average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

5.保存模型+提取模型

保存模型：

import pickle

with open("sklearn.pkl","wb") as f:
    pickle.dump(clf, f)

提取模型：

with open("sklearn.pkl","rb") as f:
    loaded = pickle.load(f)

用提取出的模型預測：

print(test_x[0])
loaded.predict(test_x_vectors[0])

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Sklearn專題實戰——數據處理+模型構建+網格搜索+保存(提取)模型

文章目錄

1.前言

2.數據處理

3.模型構建

3.1.支持向量機

3.2.決策樹

3.3.邏輯迴歸

4.網格搜索尋找最優結果

5.保存模型+提取模型

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

劍指offer面試題63. 股票的最大利潤(動態規劃)

劍指offer面試題61. 撲克牌中的順子(排序)(遍歷)

Sklearn專題實戰——針對Category特徵進行分類

劍指offer面試題64. 求1+2+…+n(邏輯符短路)(遞歸)

劍指offer面試題65. 不用加減乘除做加法(位運算)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結