第一章.Classification -- 09.A Simple Classification Simulation翻譯

so Cynthia has been talking about the theory of classification

and specifically she's been talking also about logistic regression

so in this video I'm going to show you using some R code,

a simulation in R some of the properties of classifiers in general but also of logistic regression alright

so let's have a look at some R code so first thing we're going to do is simulate some data

and we have a label which we're going to call Z which can have either a true-or-false logical value

and the features which were going to call x and y which kind of two features in this case are going to be generated from random normals

and they're uncorrelated so very simple case here and the centroid of the true cases is going to be 1 1 and the centroid of the false case is going to be -1 -1

so let me run that and you can look at the head of the data frame

and as advertised we have our two features X and Y with some various random generated values

and we have our label true or false let's just look at an X,Y plot

so there's no trick here this is just simple ggplot2

but i'm going to make the color of the point i plot based on that label z true or false

so there you have it so we have kind of blue-green true and red false

and you can see there's some overlap there's some red points that are clearly getting up into the blue zone and maybe a few blue green points that are getting into the red area so no matter what classifier we use given this overlap in the feature sets we're never going to get a perfect result and as you recall Cynthia talked about this logistic function

so let me plot that and all I'm doing here is I just compute that remember it's just the exponential of X over 1 plus the exponential of X is one way to write that so I'm just creating a hundred

values of X and then a hundred values of

Y and just plotting them with for you so

let me run that and you see the plot here and

so what can you see you see the sigmoidal logistic function here

and you have this decision threshold at 0 so I've made this 0.5 right in the middle

so if the result of the regression is more than 0.5

we're going to score that as a positive or a true

if it's less than 0.5 we're going to score that as a negative or a false but keep in mind

i can move that decision boundary

i can move it down here in which case i'm going to score more points as false as true

and fewer points as false if I move it the other way I get the opposite effect

ok so let's compute a logistic regression model here

so I just got this really simple function called logistic.mod and i'm using the R GLM function

and so I model z by 0 means no intercept

so it's just x by x and x by y  

and i'm going to use a binomial distribution family because it is logistic regression

we just have that true-false binomial output

and then i can predict my score using the predict method on that model

and then i'm going to evaluate that model so Cynthia's talked about this

we decide whether something is a true say positive if Zis true and the score is true it's true positive

if Z is false but the actual score is true it's a false positive etc.

so just simple logic there and we can count the errors and then we can plot that using color

and now shape to show whether it's an error dot

so color shows us the true label shape is going to show us whether we scored it correctly or not

and we compute the elements of our confusion matrix easily enough

and we compute the usual statistics of accuracy precision

and recall and print those using the usual formulas so let me run that code for you

we'll see the outcome so we have this confusion matrix

and you see we're doing pretty well

we've got 46 true positives 44 true negatives only 6 false positives where it was actually a negative value and we said it was positive and we've got four false negatives where really was positive

but we said it was negative and if we look at our plot here remember round are correct triangles are error and red is false label blue-green is true label

so you can see the decision boundary on that logistic function has to run somewhere through here as we've got like a false here 

that's a false positive because it's actually negative but it's false positive

we've got some false negatives here with these green triangles overall looks pretty good

now we'll move that decision boundary 

so now we're going to look at different probabilities on that logistic function 

we've been working with the default of 0.5 but we're going to move to 0.25 and 0.125 

so by doing that we're going to get more true positives 

but at the expense of false positives which is negative values we're going to score as positive 

and we're also going to make the problem harder

we're moving the centroids for the true or positive values to 0.5 0.5 

and for the negative values to -0.5 -0.5

so we're moving the centroids of the feature sets for the two cases much closer together making the modeling problem and so we just loop through those probabilities we compute a model

and we evaluate that model and we'll just look at the results here

so here's our first case this is for a bit you know where we're right in the middle of that logistic sigmoid and we got an accuracy of seventy-four percent precision of .75

and recall .72 notice in our confusion matrix we still have mostly true positives 

and mostly true negatives 

but we have a fair number of false negatives and false positives 

and you can see what's happened because we've moved the data closer together

we've got a lot more triangles here 

the red triangles being false positives the green blue green triangles being false negatives 

and you can imagine our decision boundary is somewhere here 

so it's moved over - it's right down the middle

now let's move it over and you can see we get lot more positives all the false positives were getting here with all those red triangle 

so we look at the statistics there 

our accuracy has dropped to seventy percent our precision has dropped to basically .64 

but recall has jumped up to .92

and so we have true positives positives quite a few more but 

but many fewer false negatives and we'll move that boundary even more radically 

and this time look at our accuracy is down to .6 the precision is .56 

and the recall is really high .98 and we only have one false negative 

and but a lot of false positives here 

and you can see that the plot so here's our one false negative 

and all those red triangles are

are false positives

so I hope that's given you some feel for not just logistic regression 

and how we do it in our but but also how classifiers work in general 

and how doing things like moving the decision boundary or playing with other aspects of your classifier can greatly affect the performance that you observe when you're building machine learning models

辛西婭一直在講分類理論。

具體來說,她還談到了邏輯迴歸。

在這個視頻中,我將用一些R代碼,

對一般分類器的一些屬性進行了仿真,並對邏輯迴歸進行了分析。

讓我們看一些R代碼首先我們要做的是模擬一些數據。

我們有一個標籤,我們稱之爲Z,它可以有一個真或假的邏輯值。

這些特徵會被稱爲x和y這兩個特徵在這個例子中是由隨機法線生成的。

它們是不相關的非常簡單的情況,真實情況的質心是1 1,而假情況的質心是-1 -1。

我來運行一下,你可以看看數據框的頭部。

正如廣告中所說的,我們有兩個特徵X和Y帶有一些隨機生成的值。

我們的標籤是對的還是錯的我們來看看X,Y的圖。

這裏沒有技巧這只是簡單的ggplot2。

但是我要把我畫的點的顏色標爲z是真還是假。

所以你有了它所以我們有一種藍綠色的真和紅色的假。

你可以看到有一些重疊有一些紅點,顯然是起牀到藍色區域,也許一些藍綠點進入紅色區域所以不管我們使用分類器鑑於這種重疊的特性集,我們永遠不會得到一個完美的結果,你還記得辛西婭談到這個邏輯函數

我來畫一下,我在這裏做的就是計算一下,記住它是X / 1的指數加上X的指數是一種寫法,所以我只創建了100。

X的值,然後是100的值。

把它們畫出來。

讓我運行一下,你可以看到這裏的情節。

那麼你能看到什麼呢?

你在0處有這個決策閾值所以我把這個0。5寫在中間。

所以如果迴歸的結果大於0。5。

我們要把它作爲一個正數或一個真值。

如果小於0。5,我們就會把它看成是負的或者是假的但要記住。

我可以移動這個決策邊界。

我可以把它移到這裏,在這種情況下,我將會得到更多的分數。

如果我用另一種方式移動它,就會得到相反的效果。

我們來計算一個邏輯迴歸模型。

我得到了一個很簡單的函數,叫做logistic。我用的是R GLM函數。

所以我用z除以0表示沒有截距。

它就是x×x和x×y。

我將使用二項分佈族因爲它是logistic迴歸。

我們得到的是真假二項輸出。

然後我可以用那個模型的預測方法來預測我的分數。

然後我要對這個模型進行評估所以辛西婭說過這個。

如果z是真的,我們決定是否爲真,分數爲真,爲正。

如果Z是假的,但實際的分數是真的,這是假陽性。

簡單的邏輯,我們可以計算錯誤然後我們可以用顏色來畫。

現在來顯示它是否是一個錯誤點。

所以顏色告訴我們真正的標籤形狀將會顯示我們是否正確地得分。

我們很容易地計算出我們的混淆矩陣的元素。

我們計算了通常的精確精度統計。

回憶一下,用通常的公式打印出來讓我幫你運行代碼。

我們會看到結果,所以我們有這個混亂矩陣。

你看我們做得很好。

我們有46個真實的陽性結果44個真實的陰性結果只有6個假陽性,在那裏它實際上是一個負值,我們說它是陽性的,我們有四個假陰性,在那裏真的是陽性。

但是我們說它是負的,如果我們看一下我們的圖記住,圓是正確的三角形是錯誤的,紅色是假標籤藍綠色是真正的標籤。

所以你可以看到,這個邏輯函數的決定邊界必須在這裏運行,因爲這裏我們有一個錯誤。

這是假陽性,因爲它實際上是負的,但它是假陽性。

我們有一些假的底片,這些綠色三角形整體看起來不錯。

現在我們將移動這個決策邊界。

現在我們來看一下logistic函數的不同概率。

我們已經處理了0。5的默認值但是我們將移動到0。25和0。125。

通過這樣做,我們會得到更多真正的好處。

但以假陽性爲代價我們將會得到陽性結果。

我們也會讓問題變得更棘手。

我們要把真值或正值的質心移動到0.5。5。

對於負的-0。5 -0。5。

所以我們將兩個例子的特徵集的質心移動得更近,從而使建模問題更加緊密,因此我們只需要對這些概率進行循環,我們就可以計算出一個模型。

我們對這個模型進行評估,看看結果。

這是我們的第一個例子,你知道我們現在的位置在logistic sigmoid的中間我們得到了75%的精確度。

回想一下,在我們的困惑矩陣中,我們仍然有很多真正的優點。

和大部分真正的底片

但我們有很多假陰性和假陽性。

你可以看到發生了什麼,因爲我們把數據移動得更近了。

這裏有很多三角形。

紅色三角形爲假陽性,綠色的藍色三角形爲假陰性。

你可以想象我們的決策邊界在這裏。

所以它移動了,它在中間。

現在讓我們把它移過去,你可以看到我們得到了更多的陽性結果所有的假陽性都得到了這些紅色的三角形。

我們來看一下統計數據。

我們的精確度已下降到百分之七十,我們的精確度基本上下降到百分之六十。

但召回率上升到了。92。

所以我們有了更多的積極因素,但是。

但更多的假陰性結果我們會更徹底地改變這個界限。

這一次我們的精確度下降到。6精度是。56。

召回率很高,98,我們只有一個假陰性。

但是這裏有很多假陽性。

你可以看到這個圖,這是一個假陰性。

所有的紅色三角形都是。

是假陽性

所以我希望這能給你們一些感覺不僅僅是邏輯迴歸。

我們是怎麼做到的,但是分類器又是如何工作的。

在構建機器學習模型時,如何移動決策邊界或處理分類器的其他方面會極大地影響您觀察到的性能。

發佈了30 篇原創文章 · 獲贊 50 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章