classification algorithm

The purpose of this post is to introduce the basic knowledge about classification, and show seven most commonly algorithms with Python code: Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbours, Decision Tree, Random Forest, and Support Vector Machine.

Introduction

Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under. The terminologies encountered in the classification algorithm is shown as follows:

  • Classifier: An algorithm that maps the input data to a specific category.
  • Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.
  • Feature: A feature is an individual measurable property of a phenomenon being observed.
  • Binary Classification: Classification task with two possible outcomes. Eg: Gender classification (Male / Female)
  • Multi-class classification: Classification task with more than two classes. In multi-class classification each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not both at the same time.
  • Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article can be about sports, a person, and location at the same time.

The following are the steps involved in building a classification model:

  • Initialize: Initialize the parameter of the used classification algorithm…
  • Train the classifier: All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and train label y.
  • Predict the target: Given an unlabeled observation X, the predict(X) returns the predicted label y.
  • Evaluate the classifier model

Types of Classification Algorithms (Python)

Logistic Regression

Logistic regression is a machine learning algorithm for classification. in which the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.

Logistic regression is most useful for understanding the influence of several independent variables on a single outcome variable. Nevertheless, it only works when the predicted variable is binary, Further, it assumes that all predictors are independent of each other and data is free of missing values.

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

Naive Bayes

Definition: Naive Bayes algorithm is based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.

This algorithm requires a small amount of training data to estimate the necessary parameters, and are extremely fast compared to more sophisticated methods. But, it is is known to be a bad estimator.

from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train, y_train)
y_pred = nb.predict(x_test)

Stochastic Gradient Descent

Definition: Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.

Its virtues are efficiency and ease of implementation. But, it needs a number of hyper-parameters and it is sensitive to feature scaling.

from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss='modified_huber', shuffle=True, random_state=101)
sgd.fit(x_train, y_train)
y_pred = nb.predict(x_test)

K-Nearest Neighbours

Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.

Its advantages are that: It is simple to implement, robust to noisy training data, and effective if training data is large. But, it needs to determine the value of K and the computation cost is high as it needs to computer the distance of each instance to all the training samples.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

Decision Tree

Its advantages is that: It is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data. But, it can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.

from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=10, random_state=101, max_features=None, min_samples_leaf=15)
dtree.fit(x_train, y_train)
y_pred = dtree.predict(x_test)

Random Forest

Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets, its sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.

Its advantage is that it can reduce over-fitting, and it is more accurate than decision trees in most cases. But, disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.

from sklearn.ensemble import RandomForestClassifier
rfm = RandomForestClassifier(n_estimator=70, oob_score=True, n_jobs=-1, random_state=101, max_features = None, min_samples_leaf=30)
rfm.fit(x_train, y_train)
y_pred = rfm.predict(x_test)

Support Vector Machine

Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

from sklearn.svm import SVC
svm = SVC(kernel="linear", C=0.025, random_state=101)
svm.fit(x_train, y_train)
y_pred = svm.predict(x_test)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章