線性分類器－Tumer Prediction

原創

cicilover

2020-02-23 15:37

腫瘤預測數據地址：https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

注：下文對於有缺失值的數據都暫時做忽略處理

用LogisticRegression和 Stochastic Gradient Descend隨機梯度下降算法對該數據集進行分類，並做預測的性能統計。

Python源碼：

#coding=utf-8
import pandas as pd
import numpy as np
#-------------
#use train_test_split to split data
from sklearn.cross_validation import train_test_split
#-------------
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
#-------------
from sklearn.metrics import classification_report


#-------------download data
#create feature list
column_names=['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']

#use pandas.read_csv funtion to read data from internet
data=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names=column_names)

#replace ? with standard missing value representation
data=data.replace(to_replace='?',value=np.nan)
#drop the data which has missing value(one or more dimension has missing value)
data=data.dropna(how='any')
#output the total counts and dimensions of the data
print data.shape

#-------------prepare training and testing datas
#random select 25%datas for testing,75% for training
X_train,X_test,y_train,y_test=train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)
#see the nums and types of traingData
print y_train.value_counts()
#see the nums and types of testingData
print y_test.value_counts()
#-------------use Linear Classification Model to make predictions
#standardize the data，make sure that datas on each dimension variance is 1，mean value is 0. Do this to make sure that the result won't be dominanted by some dimension because of some large characteristic value
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)

#initialize  LogisticRegression and  SGDClassifier
lr=LogisticRegression()
sgdc=SGDClassifier()

#use fit function/model on LogisticRegression to train model prams
lr.fit(X_train,y_train)
#use trained model lr to make prediction at X_test and store the result on lr_y_predict
lr_y_predict=lr.predict(X_test)

#use fit function/model on SGDClassifier to train model prams
sgdc.fit(X_train,y_train)
#use trained model sgdc to make prediction at X_test and store the result on sgdc_y_predict
sgdc_y_predict=sgdc.predict(X_test)

#-------------performance analysis
#use score function provided by LR model to get Accuracy result
print 'Accuracy of LR Classifier:',lr.score(X_test,y_test)
#get other three index
print classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant'])

#use score function provided by SGD model to get Accuracy result
print 'Accuracy of SGD Classifier:',sgdc.score(X_test,y_test)
#get other three index
print classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignant'])

Result：

(683, 11)
2 344
4 168
Name: Class, dtype: int64
2 100
4 71
Name: Class, dtype: int64
Accuracy of LR Classifier: 0.988304093567
precision recall f1-score support

Benign 0.99 0.99 0.99 100
Malignant 0.99 0.99 0.99 71

avg / total 0.99 0.99 0.99 171

Accuracy of SGD Classifier: 0.982456140351
precision recall f1-score support

Benign 1.00 0.97 0.98 100
Malignant 0.96 1.00 0.98 71

avg / total 0.98 0.98 0.98 171

LR和SGDClassifier：前者對參數的計算採用精確解析的方式，計算時間長但是模型性能略低，後者採用隨機梯度上升算法估計模型參數，計算時間時間短但模型性能略高。一般，對於訓練數據規模在10萬量級以上的數據，考慮到時間到耗用，更推薦使用SGD算法對模型參數進行估計。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

線性分類器－Tumer Prediction

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

線性分類器－Tumer Prediction

普通程序員如何轉向AI方向

蘋果核 - 天貓APP改版之全新大首頁架構&開發模式全面升級-TAC

支持向量機－手寫數字識別

樸素貝葉斯－新聞分類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結