第一章.Classification -- 10.Creating a Classifier with Python翻譯

so Cynthia has been showing us some theory of classifiers and specifically talking about the logistic regression as kind of a baseline classifier and in this video we're going to use a regression logistic regression classifier we're going to look at some of the properties of logistic regression in specific but keep in mind we're talking about principles here that work for almost any classification model and so let's have a look at some Python code and look at a little simulation i have prepared for you so the first thing we have to do is create a data set and we're just going to have to features in our data set

which we're just going to call x and y and we're going to have two possible States for our label which were calling z it can be 1 or 0 so essentially true or false and we're going to generate locations for those two feature labels using bivariate normal distribution that with no correlation between the X&Y just have an x and y value that are randomly generated from these normal distributions we're going to put all that into a date into data frames you see and we're going to concatenate those two data frame so we come up with one data set here that has two different feature labels and and and various values for the X&Y feature and so we'll just look at the head of that data frame just to get a feel for what it looks like and as advertised we have our X variable and our Y variable those are two features and then Z our label which can be 1 or 0 so let's plot that data set so we're just going to create a fairly standard Python plot here using the pandas plot method we're going to scatter plots and in the case where that label Z is 1 we're going to make them red and in the case where that label 0 we're going to make them dark blue and otherwise it's all pretty standard so let's just have a look at that and our little plot so you can see the red

X's are our positive label or true, and our little blue circles here are negative label or false and you can see there's some overlap use see there some red x's that get into this area where there's the population of blue dots and likewise there's some blue dots that are getting up into the area of the red x's so so whatever classifier we use on this it's unlikely we can get a hundred percent accuracy just given these two features in this overlap between the two labels and this is very typical of machine learning problem ok and let's just talk about the logistic function here Cynthia's shown you this already but i'm just going to create a plot of it so it's actually quite simple I just create an X and Y using these list comprehensions and the Y is the logistic function which is just the exponential over 1 plus the exponential of whatever that

Z is or that X is for Z and X. So, and then I just plot those alright and you see you get this expected sigmoidal behavior here and the default in the normal way you start with logistic regression as you say okay if the value is greater than point five we'll call that one we'll call that a plus, if the value is less than point five from the regression we call that a minus or zero or false and but we can move keep in mind we can move this critical point this decision point up so that we favor negative values over misclassifying positive values or we can move it down ok just keep that in mind as we go through this demo so now we'll start from this data set I just showed you and keeping in mind we'll just use the default value of equal you know a half for positive or negative label which is a log probability by the way of 1.0 so we're going to use scikit-learn here to create a logistic regression so I have to do a little bit of reshaping and I have to use the as.matrix method just to make sure that I get a numpy data frame which is what scikit-learn needs so that's all I'm doing with this and Ravel helps me flatten make sure that that is a 1 by matrix so I've got my X which are my two features and my Y which is my label and you see I imported linear model from scikit-learn so I could linear_model.LogisticRegression to get a regression object that i need and then I fit using that x and y and now I've gotta predict method here, so I'm just gonna append that as a new column to my data frame then i'm going to evaluate so we've talked about how to evaluate any classification model so basically i'm filling in my confusion matrix here so i got my true positives my false positives my true negatives my false negatives i'm using a series of conditionals here and you know so if it's the predicted equals 1 and z is equal to the predicted then it's a true positive

likewise if the predicted equals 0 and z is equal to predictive it's a true negative and otherwise they're false either positive or negative and then so that I have these four cases and I can I can plot those scatter plots and i'm using different colors and marker size

so we can tell you know positive from negative true cases and whether they're scored correctly or not by our model and and then i compute you know the true positive to negative so this is the counts for my final confusion matrix and i can print that confusion matrix and I'll just compute some figures like accuracy precision recall that Cynthia's also discussed here so let me just run that for you so let's start out with our confusion matrix it looks like we're actually pretty accurate so we got our true negatives we got 49 we only got 1 where it's actually negative when we said it's positive likewise we only got 1 which was truly negative where we scored it is positive and otherwise we got true positive

so we only have four errors here which gives us an accuracy of ninety-six percent and a really high precision and recall and if we look at the plot i think it's a little more instructive this is very abstract and very general but you can see what's happening so here's our one- that was misclassified it's a dot and it's turned red for misclassified and we've got these three positives which are also misclassified and you can imagine in your mind that the decision boundary has got to be running something like this to get that ok but we can move the decision boundary right so remember on that logistic function that I showed you we could move that decision point up or down and

so if we move it up which means we're favoring positive outcomes over negative outcomes we can move that decision boundary will move this way so let me do that for you so this code will simulate some data data set and we're going to use those decision probabilities 1.0 which is what we just did that's balanced so these are log probabilities by the way two or four so we just kind of moving this way alright let me run that so recalling that logistic function we can move that threshold or that decision point up or down

so if we move it up we're favoring negative values over the false positives if we move it down we're favoring of positive values / false negatives and so in this demo I'm going to first off make the problem harder so I've moved the centroids from 1 1 and -1 -1 to halves so I've basically moved the the data for positive and negative cases closer together so we're more overlap and we're going to look at log probabilities there of 1, 2, 4

so as we do that we're going to favor getting the positives right and the negatives wrong

to let me run that for you you see in our first case we mostly get true negatives and true positives are in

the majority but we get our true positive case we get some false negatives and for the true negative case we get some false negatives are overall accuracy is like 75 percent are precision is .75 and recall is in that range too but as we move that decision boundary notice that the number of false negatives

so they're truly positive but they score is negative has dropped quite a bit but the number of negatives which were incorrectly scored as positive has gone up so our accuracy is dropped to .7 are precision is dropped to .66 but our recall has gone up to .84 and likewise we move that boundary again we now have

very few false negatives we have a lot more false positives and our accuracy is not to affected that are precision has gone down again and I recall is now way up at .9 too so we can see that graphically so here's our first case and you can imagine the decision boundary has to go something like this

and you can see false negatives are that X's false positives are the red circles ok and true positives are in the blue pluses and true negatives are in the blue circles so now I've moved that boundary a little bit and notice we have a lot more red circles now so a lot more false positives but we don't have as many red pluses and we have a lot more blue pluses and I move that boundary again that's the last case i showed you now we're down to just four red pluses so we only have four misclassified positive values but we have a lot more misclassified negative values so I hope this little demo has given you a feel for not only logistic regression but what the behavior of regression models is with respect to say the overlap of features we've seen two different cases there and also how doing things to change that decision boundary can affect the performance statistics you see from your machine learning model

所以辛西婭已向我們展示一些分類理論和專門討論邏輯迴歸作爲一種基準分類器,這一節我們將使用迴歸邏輯迴歸分類器我們會看一些邏輯迴歸在特定的屬性,但請記住我們在這裏討論的原則適用於幾乎任何分類模型,所以讓我們看看一些Python代碼,看看小模擬我有準備首先，我們要做的是創建一個數據集，我們要在數據集中做一些功能。

我們將稱之爲x和y,我們要爲我們的標籤都有兩種可能的狀態調用z可以1或0所以真或假,我們要爲這兩個特性生成位置標籤使用二元正態分佈,沒有相關性X&Y只有一個x和y值,從這些正態分佈隨機生成的我們要把所有的日期數據幀你看到我們會將這兩個嗎數據幀所以我們想出一個數據集,有兩個不同的功能標籤和和各種用戶的值的特性,所以我們只看數據幀的頭來了解它是什麼樣子,宣傳我們變量X和Y變量這兩個特性,然後Z標籤可以1或0讓我們這樣情節數據集我們要創建一個相當標準的Python情節在這裏使用熊貓圖方法我們要分散情節和標籤Z等於1的情況下我們要讓他們紅標籤0的情況下我們要讓他們深藍色,否則都是非常標準的讓我們看一看,我們的小所以你可以看到紅色的陰謀

X是正的，或者是正確的，和我們的小藍色的圓圈是負面的標籤或錯誤的,你可以看到有一些重疊使用看到一些紅色的x的進入這個領域有藍點的人口和同樣有藍點,越來越成紅色的x的面積我們不管分類器使用在這個不太可能我們可以得到百分之一百的準確性給這兩個特性這兩個標籤之間的重疊,這是非常典型的機器學習問題可以讓我們談談物流函數辛西婭已經顯示你但我要創建一個塊,所以它實際上是很簡單我只是創建一個使用這些列表理解X和Y,Y是邏輯函數是指數/(1 +的指數

Z是X或者X Z和因此,然後我只是情節這些好了,你看你得到這個預期s形的行爲和默認與邏輯迴歸正常的方式開始就像你說的好,如果該值大於5點我們叫一個我們稱之爲a +,如果該值小於5點從迴歸我們稱之爲-或零個或虛假,但我們可以記住我們可以移動這個臨界點這個決策點,我們將積極的價值觀或劃分的偏向負面值可以搬下來好記住我們在這個演示現在我們會從這個數據集我只是給你們,記住我們就使用默認值等於你知道正面或負面標籤的一半是用1。0的對數概率所以我們要用scikit-學習來創建邏輯迴歸所以我要做一些重塑，我要用as。矩陣法來確保我得到一個numpy數據幀就是scikit-learn需要這就是我所做的這個和拉威爾幫助我平確保1的矩陣X我這是我的兩個特性和Y是我的標籤,你看我從scikit-learn進口線性模型可以linear_model。邏輯迴歸得到我需要的迴歸對象然後我用x和y進行匹配現在我需要預測方法，所以我要添加新列我的數據幀然後我要評估我們談論如何評估任何分類模型基本上我填寫我的混淆矩陣這裏我真陽性我的假陽性我真正陰性假陰性我使用一系列的條件,你知道如果是預測= 1和z等於預測的那麼真陽性

同樣地，如果預測值爲0 z等於預測值這是一個真實的負數，否則它們是假的，要麼是正的，要麼是負的，然後我就有了這4個例子，我可以畫出這些散點圖，我用了不同的顏色和標記大小。

我們可以告訴你知道積極的從消極的真正的情況下,他們是否取得正確與否,我們的模型,然後我計算你知道真正積極的-這是我最後的混淆矩陣計算,我可以打印,混淆矩陣和計算一些數字就像精度精度回想一下,辛西婭也是這裏討論讓我跑,你讓我們開始混淆矩陣看來我們所以我們相當準確我們的真正的否定是49我們只有1，當我們說它是正數的時候它實際上是負的，我們只有1，它是真正的負的，我們得到它是正數，否則我們得到了真正的正數。

所以我們只有四個錯誤在這裏給我們百分之九十六的精度和很高的精度和召回,如果我們看情節我認爲這是更有益的這是非常抽象的,很一般但是你可以看到這是我們發生了什麼——這是分類錯誤的一個點,它的變紅了,是不是也和我們有這三個陽性是不是和你可以想象在你的頭腦中,決策邊界必須運行類似這樣的東西可以得到這樣的結果但是我們可以移動決策邊界所以記住，我向你們展示的logistic函數我們可以向上或向下移動這個決策點。

如果我們移動,這意味着我們支持積極的結果在消極的結果我們可以移動這一決定邊界將這種方式讓我爲你這麼做,這段代碼將模擬數據的數據集,我們將使用這些決策概率1.0這是我們剛纔做的平衡這是對數概率的兩個或四個我們只是移動這樣好的讓我跑那麼召回物流功能我們可以移動這個門檻決策點向上或向下。

如果我們行動起來支持負值的假陽性如果我們搬下來我們支持積極的價值觀/假陰性,所以在這個演示我要首先使問題困難所以我把重心從1 1 1 1到半我基本上把正面和負面的數據情況下靠近我們更多的重疊和我們要看日誌概率1,2,4

所以當我們這樣做的時候，我們會傾向於得到積極的權利和消極的錯誤。

讓我來幫你們看看，在我們的第一個案例中，我們主要得到了真實的否定和真正的積極因素。

大多數但我們得到真正積極的情況下我們得到一些假陰性和真正的負面案例我們得到一些假陰性總體精度是75%精度。和召回,但隨着這一決定邊界注意假陰性的數量

所以他們真正積極但他們得分是負的數量下降了不少,但底片被錯分積極上漲了所以我們的精度下降到7精度下降點,但我們的回憶已經點,同樣我們有再次移動邊界

很少假陰性我們更多的假陽性和精度是不影響精度又下降了,我記得現在的方式在圖形化。9所以我們可以看到,這是我們的第一個案例,你可以想象這個決定邊界必須是這樣的

你可以看到假陰性,X的假陽性是紅圈好和真正的優點是在藍色的優勢和真正的不利因素是藍色的圓圈現在我搬,邊界有點注意我們現在有更多的紅圈更多的假陽性但我們沒有紅色的優點和我們有更多的藍色優點我移動邊界是過去的情況我給你我們現在僅剩下四個紅色的優點所以我們只有四個被誤診積極的價值觀,但我們有更多的分類錯誤的負值所以我希望這個小演示給你感覺不僅邏輯迴歸,迴歸模型的行爲是對的重疊特性我們看到兩種不同的情況下,也做一些改變這一決定邊界如何影響性能統計數據你看到從你的機器學習模型