异常检测实战

原創

上进的菜鸟

2020-06-23 20:09

数据科学导论 python语言实现

一、单变量异常检测（一次观测一个变量）

1.1 Z-scores 得分绝对值超过3的

1.2 箱线图

import numpy as np
from sklearn import preprocessing
normailized_data = preprocessing.StandardScaler().fit_transform(boston.data[:,continuous_variables])
outlier_rows,outlier_columns = np.where(np.abs(normalized_data)>3)

单变量方法不能检测哪些不是极端值的异常值，然而，如果它发现两个或多个变量的组合出现不正常的值，所涉及的不是极端值的概率会大，因此多变量检测应运而生

二、多变量异常检测（同时考虑多个变量）

2.1 covariance.EllipticEnvelope类：

假设全部数据可以表示成基本的多元高斯分布，.EllipticEnvelope是一个试图找出数据总体分布关键参数的函数。检查每个观测量与总均值的距离，总均值要考虑数据集中的所有变量，因此算法能同时发现单变量和多变量的异常值

from sklearn.covariance import EllipticEnvelope

robust_covariance_est = EllipticEnvelope(contamination=0.05,store_precision=False,assume_centered = False)
robust_covariance_est.fit(data)
detection = robust_covariance_est.predict(data)
outliers = np.where(detection == -1)
regular = np.where(detection == 1)

缺点：当数据有多个分布时，算法视图将数据适应一个总体分布，倾向于寻找最偏远聚类中的潜在异常值，而忽略了数据中其他可能受异常值影响的区域

注意：数据要标准化再使用更好一点

2.2svm.OneClassSVM类：

默认参数：kernel=rbf，degree=3，gamma，nu：决定模型是否符合一个精确分布，还是应该保持某种标准分布而不太注重适应现有的数据（如果有异常值存在，选择后者）

from sklearn.decomposition import PCA
from sklearn import proprecessing
from sklearn import svm
# 标准化
continuous_variable = [n for n in range(boston.data.shape[1]) if n !=3]
normalized_data = preprocessing.StandardScalar().fit_transform(boston.data[:,continuous_variables])
# pca
pca = PCA(n_components = 5)
Zscore_components = pca.fit_transform(normalized_data)
vtot = 'PCA Variance expained' + str(round(np.sum(pca.explained_variance_ration_),3))
# oneclasssvm
outliers_fraction = 0.02
nu_estimate = 0.95*outliers_fraction + 0.05
machine_learning=svm.OneClassSVM(kernel='rbf',gamma=1/len(normalized_data),degree=3,nu=nu_estimate)
machine_learning.fit(normalized_data)
detection = machine_learning.predict(normalized_data)
outliers = np.where(detection==-1)
regular = np.where(detection==1)
# 可视化
from matplotlib import pyplot as plt 
for r in range(1,5):
    in_points = plt.scatter(Zscore_components[regular,0],Zscore_components[regular,r],c='blue',alpha=0.8,s=60,marker='o',edgecolor='white')
    out_points = plt.scatter(Zscore_components[outliers,0],Zscore_components[outliers,r],c='red',alpha=0.8,s=60,marker='o',edgecolor='white')
    plt.legend((in_points,out_points),('inliers','outliers'),scatterpoints=1,loc='best')
    plt.xlabel('Component 1 ( '+str(round(pca.explained_variance_ratio_[0],3))+')')
    plt.ylabel('Component'+str(r+1)+'('+str(round(pca.explained_variance_ratio_[r],3))+')')
    plt.xlim([-7,7])
    plt.ylim([-6,6])
    plt.title(vtot)
    plt.show()

当然，除了PCA还有RandomizedPCA，FactorAnalysis更适合大数据集合

kernelPCA将信号映射到非线性空间

3.DBSCAN

from sklearn.cluster import DBSCAN
dbs_2 = DBSCAN(eps=0.5)
labels_2 = dbs_2.fit(dataset_2).labels_
np.unique(labels_2)
#如果结果里出现-1，就是异常点所在的类

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

异常检测实战

通过HPA+CronHPA组合应对业务复杂弹性伸缩场景

多個left join的疑問

異常檢測實戰

時間序列流程

python非參數檢驗

從組合中估計概率

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結