R語言--異常值檢測

自編函數,boxplot()原理

outlier.IQR <- function(x, multiple = 1.5, replace = FALSE, revalue = NA) { 
  q <- quantile(x, na.rm = TRUE) #四分位間距3倍間距以外的認爲是離羣值
  IQR <- q[4] - q[2]
  x1 <- which(x < q[2] - multiple * IQR | x > q[4] + multiple * IQR)
  x2 <- x[x1]
  if (length(x2) > 0) outlier <- data.frame(location = x1, value = x2)
  else outlier <- data.frame(location = 0, value = 0)
  if (replace == TRUE) {
    x[x1] <- revalue
  }
  return(list(new.value = x, outlier = outlier))
}

結果輸出爲列表，分別爲 outlier.IQR()$new.value 和 outlier.IQR()$outlier。前者爲異常值替換後的新向量，後者爲原向量中異常值及其所在位置。

異常檢測，主要內容如下：

（1）單變量的異常檢測

（2）使用LOF（local outlier factor，局部異常因子）進行異常檢測

（3）通過聚類進行異常檢測

（4）對時間序列進行異常檢測

單變量異常檢測

本部分展示了一個單變量異常檢測的例子，並且演示瞭如何將這種方法應用在多元數據上。在該例中，單變量異常檢測通過boxplot.stats()函數實現，並且返回產生箱線圖的統計量。在返回的結果中，有一個部分是out，它結出了異常值的列表。更明確點，它列出了位於極值之外的鬍鬚。參數coef可以控制鬍鬚延伸到箱線圖外的遠近。在R中，運行?boxplot.stats可獲取更詳細的信息。

如圖呈現了一個箱線圖，其中有四個圈是異常值。

> set.seed(1234)
> x <- rnorm(1000)
> summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.39606 -0.67325 -0.03979 -0.02660  0.61582  3.19590 
> boxplot.stats(x)$out
 [1]  3.043766 -2.732220 -2.855759  2.919140 -3.233152 -2.651741
 [7] -3.396064  3.195901 -2.729680 -2.704203 -2.864347 -2.661346
[13]  2.705775 -2.906674 -2.874042 -2.757050 -2.739754
> y=rep(1,1000)
> z = data.frame(x,y)
> g <- ggplot(z,aes(y=x,x=y))
> g+geom_boxplot()

如上的單變量異常檢測可以用來發現多元數據中的異常值，通過簡單搭配的方式。在下例中，我們首先產生一個數據框df，它有兩列x和y。之後，異常值分別從x和y檢測出來。然後，我們獲取兩列都是異常值的數據作爲異常數據。

在下圖中，異常值用紅色標記爲”+”

> y = rnorm(1000)
> df <- data.frame(x,y)
> rm(x,y)
> head(df)
           x          y
1 -1.2070657 -1.2053334
2  0.2774292  0.3014667
3  1.0844412 -1.5391452
4 -2.3456977  0.6353707
5  0.4291247  0.7029518
6  0.5060559 -1.9058829

> attach(df)
> #find the index of outliers from x
> (a<- which(x  %in% boxplot.stats(x)$out))
 [1] 178 181 192 227 237 382 392 486 487 517 558 717 771 788 901 949
[17] 967
> 
> #find the index of outliers from y
> 
> (b <- which(y %in% boxplot.stats(y)$out))
[1] 121 233 317 359 517 660 815
> 

> detach(df)
> #outliers in both x and y
> (outlier.list1 <- intersect(a,b))
 [1] 517 
> 

plot(df)
points(df[outlier.list1,],col="red",cex=2.5,pch="+")

#或者用ggplot2
z = vector()
for(i in 1:1000){
  z[i]= ifelse (is.element(i,outlier.list1) ,1,2)
}
tt = cbind(df,z)
f<- ggplot(tt,aes(x = x,y=y))
f+geom_point(color=z,alpha=0.5)

類似的，我們也可以將x或y爲異常值的數據標記爲異常值。下圖，異常值用’x’標記爲藍色。

#outliers in either x or y
(outlier.list2<- union(a,b))
plot(df)
points(df[outlier.list2,],col="blue",pch="x",cex=2)

(outlier.list1 <- union(a,b))
z = vector()
for(i in 1:1000){
  z[i]= ifelse (is.element(i,outlier.list1) ,1,2)
}
tt = cbind(df,z)

f<- ggplot(tt,aes(x = x,y=y))
f+geom_point(color=z,alpha=0.5)

當有三個以上的變量時，最終的異常值需要考慮單變量異常檢測結果的多數表決。當選擇最佳方式在真實應用中進行搭配時，需要涉及領域知識。

使用LOF（local outlier factor，局部異常因子）進行異常檢測

LOF（局部異常因子）是用於識別基於密度的局部異常值的算法。使用LOF，一個點的部密度會與它的鄰居進行比較。如果前者明顯低於後者（有一個大於1 的LOF值），該位於一個稀疏區域，對於它的鄰居而言，這就表明，該點是一個異常值。LOF的缺點就是它只對數值數據有效。

lofactor()函數使用LOF算法計算局部異常因子，
並且它在DMwR和dprep包中是可用的。下面將介紹一個使用LOF進行異常檢測的例子，k是用於計算局部異常因子的鄰居數量。下圖呈現了一個異常值得分的密度圖。

library(DMwR)
#remove"Species",which is a categorical column
iris2 <- iris[,1:4]
outlier.scores <- lofactor(iris2,k=5)
plot(density(outlier.scores))

#或者ggplot2
ggplot(as.data.frame(outlier.scores),aes(x=outlier.scores))+geom_density()

#pick top 5 as outliers
> outliers <- order(outlier.scores,decreasing = T)[1:5]
> #who are outliers
> print(outliers)
[1]  42 107  23 110  63
> 
> print(iris2[outliers,])
    Sepal.Length Sepal.Width Petal.Length Petal.Width
42           4.5         2.3          1.3         0.3
107          4.9         2.5          4.5         1.7
23           4.6         3.6          1.0         0.2
110          7.2         3.6          6.1         2.5
63           6.0         2.2          4.0         1.0
>

接着，我們結合前兩個主成分的雙標圖呈現異常值

n <- nrow(iris2)
labels <- 1:n
labels[-outliers] <- "."
biplot(prcomp(iris2),cex = 0.8,xlabs = labels)

在如上代碼中，prcomp()執行了一個主成分分析，並且biplot()使用前兩個主成分畫出了這些數據。在上圖中，x和y軸分別代表第一和第二個主成份，箭頭表示了變量，5個異常值用它們的行號標記出來了。

我們也可以如下使用pairsPlot顯示異常值，這裏的異常值
用”+”標記爲紅色。

pch<- rep(".",n)
pch[outliers]<- "+"
col <- rep("black",n)
col[outliers] <- "red"
pairs(iris2,pch = pch,col = col)

Rlof包，對LOF算法的並行實現。它的用法與lofactor()相似，但是lof()有兩個附加的性，即支持k的多元值和距離度量的幾種選擇。如下是lof()的一個例子。在計算異常值得分後，異常值可以通過選擇前幾個檢測出來。注意，目前包Rlof的版本在MacOS X和Linux環境下工作，但並不在windows環境下工作，因爲它要依賴multicore包用於並行計算。

library(Rlof)
outlier.scores <- lof(iris2,k=5)
#try with different number of neighbors(k=5,6,7,8,9 and 10)
outlier.scores <- lof(iris2,k=c(5:10))

通過聚類進行異常檢測

另外一種異常檢測的方法是聚類。通過把數據聚成類，將那些不屬於任務一類的數據
作爲異常值。比如，使用基於密度的聚類DBSCAN，如果對象在稠密區域緊密相連，它們將被分組到一類。因此，那些不會被分到任何一類的對象就是異常值。

我們也可以使用k-means算法來檢測異常。使用k-means算法，數據被分成k組，通過把它們分配到最近的聚類中心。然後，我們能夠計算每個對象到聚類中心的距離（或相似性），並且選擇最大的距離作爲異常值。

如下是一個基於k-means算法在iris數據上實現在異常檢測。

#remove species from the data to cluster
> iris2 <- iris[,1:4]
> kmeans.result <- kmeans(iris2,centers=3)
> #cluster centers
> kmeans.result$centers
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053
> 
> #cluster IDs
> kmeans.result$cluster
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [29] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2
 [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2
 [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3
[113] 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3
[141] 3 3 2 3 3 3 2 3 3 2
>

#calculate distance between objects and cluster centers
> centers <- kmeans.result$centers[kmeans.result$cluster,]
> distances <- kmeans.result$centers[kmeans.result$cluster,]
> #pick top 5 largest distances
> outliers <- order(distances,decreasing = T)[1:5]
> #who are outliers
> print(outliers)
[1]  53  78 101 103 104
> print(iris2[outliers,])
    Sepal.Length Sepal.Width Petal.Length Petal.Width
53           6.9         3.1          4.9         1.5
78           6.7         3.0          5.0         1.7
101          6.3         3.3          6.0         2.5
103          7.1         3.0          5.9         2.1
104          6.3         2.9          5.6         1.8

#plot clusters
plot(iris2[,c("Sepal.Length","Sepal.Width")],pch="o",col = kmeans.result$cluster,cex=0.6)

#or ggplot2
tt = cbind(iris2,a = kmeans.result$cluster)
ggplot(tt,aes(x=Sepal.Width,y=Sepal.Length,colour=a))+geom_point()

#plot cluster centers
points(kmeans.result$centers[,c("Sepal.Length","Sepal.Width")],col=1:3,pch=8,cex=1.5)

#plot outliers
points(iris2[outliers,c("Sepal.Length","Sepal.Width")],pch = "+",col=4,cex=1.5)

在上圖中，聚類中心被標記爲星號，異常值標記爲’+’

對時間序列進行異常檢測

本部分講述一個對時間序列數據進行異常檢測的例子。在本例中，時間序列數據首次使用stl()進行穩健迴歸分解，然後識別異常值。STL的介紹，請訪問 http://cs.wellesley.edu/~cs315/Papers/stl%20statistical%20model.pdf.

#use robust fitting
> f <- stl(AirPassengers,'periodic',robust=TRUE)
> (outliers <- which(f$weights<1e-8))
 [1]  79  91  92 102 103 104 114 115 116 126 127 128 138 139
[15] 140
>

#set layout
op <- par(mar=c(0,4,0,3),oma = c(5,0,4,0),mfcol=c(4,1))
plot(f,set.pars = NULL)

sts <- f$time.series
#plot outliers
points(time(sts)[outliers],0.8*sts[,"remainder"][outliers],pch = "x",col="red")
par(op)#reset layout

在上圖中，異常值用紅色標記爲’x’

討論

LOF算法擅長檢測局部異常值，但是它只對數值數據有效。Rlof包依賴multicore包，在Windows環境下失效。對於分類數據的一個快速穩定的異常檢測的策略是AVF(Attribute Value Frequency)算法。

一些用於異常檢測的R包包括：

extremevalues包：單變量異常檢測

mvoutlier包：基於穩定方法的多元變量異常檢測

outliers包：對異常值進行測驗

kicilove

發佈了34 篇原創文章 · 獲贊 54 · 訪問量 13萬+

私信關注

R語言--異常值檢測

自編函數,boxplot()原理

單變量異常檢測

使用LOF（local outlier factor，局部異常因子）進行異常檢測

通過聚類進行異常檢測

對時間序列進行異常檢測

討論

微調真的能讓LLM學到新東西嗎:引入新知識可能讓模型產生更多的幻覺

iNeuOS工業互聯網操作系統，增加電力IEC104協議

微服務實踐k8s&dapr開發部署實驗（3）訂閱發佈

Linux中hive無法使用Delete和Backspace刪除鍵

R語言中的集合操作

你想擁有開掛的人生嗎？

【數據處理】R語言--data.table介紹以及例子

R語言--異常值檢測

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結