sklearn 細節總結

原創

skyHdd

2020-06-29 15:06

sklearn 細節總結

1、數據集劃分

常見的機器學習算法

1、數據集劃分

隨機劃分

from sklearn.model_selection import train_test_split
#data:需要進行分割的數據集#random_state:設置隨機種子，保證每次運行生成相同的隨機數#test_size:將數據分割成訓練集的比例
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

分層劃分

常用於不均勻分類問題

from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
split = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)
print(split) 
for train_index, test_index in split.split(X, y):
    print('TRAIN:', train_index, 'TEST:', test_index)
    X_train, X_test = X[train_index],X[test_index]
    y_train, y_test = y[train_index],y[test_index]
    print(len(X_train),len(X_test))

# doctest: +ELLIPSIS# StratifiedShuffleSplit(n_splits=3, random_state=0, ...)
'''StratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,train_size=None)TRAIN: [1 2] TEST: [3 0]TRAIN: [0 2] TEST: [1 3]TRAIN: [0 2] TEST: [3 1]''

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits = 1,test_size = 0.2,random_state = 42)
#根據mnist['target']來進行分層採樣
for train_index,test_index in split.split(data,data[:,-1]):
    train_set = data[train_index,:]
    test_set = data[test_index,:]
    print(len(train_set),len(test_set))

常見的機器學習算法

線性迴歸 (Linear Regression)

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets

# Create linear regression object
linear = linear_model.LinearRegression()

# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)

#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

#Predict Output
predicted= linear.predict(x_test)

邏輯迴歸

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create logistic regression object

model = LogisticRegression()

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)

#Predict Output
predicted= model.predict(x_test)

決策樹


#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)

支持向量機（SVM）


#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 

model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)

樸素貝葉斯

#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

KNN（K-鄰近算法）

#Import Library
from sklearn.neighbors import KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model 

KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

K均值算法（K-Means）

#Import Library
from sklearn.cluster import KMeans

#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model 
k_means = KMeans(n_clusters=3, random_state=0)

# Train the model using the training sets and check score
model.fit(X)

#Predict Output
predicted= model.predict(x_test)

隨機森林

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

降維算法（Dimensionality Reduction Algorithms）

#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA

train_reduced = pca.fit_transform(train)

#Reduced the dimension of test dataset
test_reduced = pca.transform(test)

Gradient Boosing 和 AdaBoost

#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

附：英文原文地址http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

sklearn 細節總結

sklearn 細節總結

1、數據集劃分

隨機劃分

分層劃分

常見的機器學習算法

線性迴歸 (Linear Regression)

邏輯迴歸

決策樹

支持向量機（SVM）

樸素貝葉斯

KNN（K-鄰近算法）

K均值算法（K-Means）

隨機森林

降維算法（Dimensionality Reduction Algorithms）

Gradient Boosing 和 AdaBoost

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

sklearn 算法調參決策樹調參

sklearn 細節總結

【利用Python進行數據分析】3-Python的數據結構、函數和文件

【利用Python進行數據分析】11 - 時間序列

Python Dataframe 兩列相除

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結