第一章.Classification -- 09.A Simple Classification Simulation翻译

so Cynthia has been talking about the theory of classification

and specifically she's been talking also about logistic regression

so in this video I'm going to show you using some R code,

a simulation in R some of the properties of classifiers in general but also of logistic regression alright

so let's have a look at some R code so first thing we're going to do is simulate some data

and we have a label which we're going to call Z which can have either a true-or-false logical value

and the features which were going to call x and y which kind of two features in this case are going to be generated from random normals

and they're uncorrelated so very simple case here and the centroid of the true cases is going to be 1 1 and the centroid of the false case is going to be -1 -1

so let me run that and you can look at the head of the data frame

and as advertised we have our two features X and Y with some various random generated values

and we have our label true or false let's just look at an X,Y plot

so there's no trick here this is just simple ggplot2

but i'm going to make the color of the point i plot based on that label z true or false

so there you have it so we have kind of blue-green true and red false

and you can see there's some overlap there's some red points that are clearly getting up into the blue zone and maybe a few blue green points that are getting into the red area so no matter what classifier we use given this overlap in the feature sets we're never going to get a perfect result and as you recall Cynthia talked about this logistic function

so let me plot that and all I'm doing here is I just compute that remember it's just the exponential of X over 1 plus the exponential of X is one way to write that so I'm just creating a hundred

values of X and then a hundred values of

Y and just plotting them with for you so

let me run that and you see the plot here and

so what can you see you see the sigmoidal logistic function here

and you have this decision threshold at 0 so I've made this 0.5 right in the middle

so if the result of the regression is more than 0.5

we're going to score that as a positive or a true

if it's less than 0.5 we're going to score that as a negative or a false but keep in mind

i can move that decision boundary

i can move it down here in which case i'm going to score more points as false as true

and fewer points as false if I move it the other way I get the opposite effect

ok so let's compute a logistic regression model here

so I just got this really simple function called logistic.mod and i'm using the R GLM function

and so I model z by 0 means no intercept

so it's just x by x and x by y  

and i'm going to use a binomial distribution family because it is logistic regression

we just have that true-false binomial output

and then i can predict my score using the predict method on that model

and then i'm going to evaluate that model so Cynthia's talked about this

we decide whether something is a true say positive if Zis true and the score is true it's true positive

if Z is false but the actual score is true it's a false positive etc.

so just simple logic there and we can count the errors and then we can plot that using color

and now shape to show whether it's an error dot

so color shows us the true label shape is going to show us whether we scored it correctly or not

and we compute the elements of our confusion matrix easily enough

and we compute the usual statistics of accuracy precision

and recall and print those using the usual formulas so let me run that code for you

we'll see the outcome so we have this confusion matrix

and you see we're doing pretty well

we've got 46 true positives 44 true negatives only 6 false positives where it was actually a negative value and we said it was positive and we've got four false negatives where really was positive

but we said it was negative and if we look at our plot here remember round are correct triangles are error and red is false label blue-green is true label

so you can see the decision boundary on that logistic function has to run somewhere through here as we've got like a false here 

that's a false positive because it's actually negative but it's false positive

we've got some false negatives here with these green triangles overall looks pretty good

now we'll move that decision boundary 

so now we're going to look at different probabilities on that logistic function 

we've been working with the default of 0.5 but we're going to move to 0.25 and 0.125 

so by doing that we're going to get more true positives 

but at the expense of false positives which is negative values we're going to score as positive 

and we're also going to make the problem harder

we're moving the centroids for the true or positive values to 0.5 0.5 

and for the negative values to -0.5 -0.5

so we're moving the centroids of the feature sets for the two cases much closer together making the modeling problem and so we just loop through those probabilities we compute a model

and we evaluate that model and we'll just look at the results here

so here's our first case this is for a bit you know where we're right in the middle of that logistic sigmoid and we got an accuracy of seventy-four percent precision of .75

and recall .72 notice in our confusion matrix we still have mostly true positives 

and mostly true negatives 

but we have a fair number of false negatives and false positives 

and you can see what's happened because we've moved the data closer together

we've got a lot more triangles here 

the red triangles being false positives the green blue green triangles being false negatives 

and you can imagine our decision boundary is somewhere here 

so it's moved over - it's right down the middle

now let's move it over and you can see we get lot more positives all the false positives were getting here with all those red triangle 

so we look at the statistics there 

our accuracy has dropped to seventy percent our precision has dropped to basically .64 

but recall has jumped up to .92

and so we have true positives positives quite a few more but 

but many fewer false negatives and we'll move that boundary even more radically 

and this time look at our accuracy is down to .6 the precision is .56 

and the recall is really high .98 and we only have one false negative 

and but a lot of false positives here 

and you can see that the plot so here's our one false negative 

and all those red triangles are

are false positives

so I hope that's given you some feel for not just logistic regression 

and how we do it in our but but also how classifiers work in general 

and how doing things like moving the decision boundary or playing with other aspects of your classifier can greatly affect the performance that you observe when you're building machine learning models

辛西娅一直在讲分类理论。

具体来说,她还谈到了逻辑回归。

在这个视频中,我将用一些R代码,

对一般分类器的一些属性进行了仿真,并对逻辑回归进行了分析。

让我们看一些R代码首先我们要做的是模拟一些数据。

我们有一个标签,我们称之为Z,它可以有一个真或假的逻辑值。

这些特征会被称为x和y这两个特征在这个例子中是由随机法线生成的。

它们是不相关的非常简单的情况,真实情况的质心是1 1,而假情况的质心是-1 -1。

我来运行一下,你可以看看数据框的头部。

正如广告中所说的,我们有两个特征X和Y带有一些随机生成的值。

我们的标签是对的还是错的我们来看看X,Y的图。

这里没有技巧这只是简单的ggplot2。

但是我要把我画的点的颜色标为z是真还是假。

所以你有了它所以我们有一种蓝绿色的真和红色的假。

你可以看到有一些重叠有一些红点,显然是起床到蓝色区域,也许一些蓝绿点进入红色区域所以不管我们使用分类器鉴于这种重叠的特性集,我们永远不会得到一个完美的结果,你还记得辛西娅谈到这个逻辑函数

我来画一下,我在这里做的就是计算一下,记住它是X / 1的指数加上X的指数是一种写法,所以我只创建了100。

X的值,然后是100的值。

把它们画出来。

让我运行一下,你可以看到这里的情节。

那么你能看到什么呢?

你在0处有这个决策阈值所以我把这个0。5写在中间。

所以如果回归的结果大于0。5。

我们要把它作为一个正数或一个真值。

如果小于0。5,我们就会把它看成是负的或者是假的但要记住。

我可以移动这个决策边界。

我可以把它移到这里,在这种情况下,我将会得到更多的分数。

如果我用另一种方式移动它,就会得到相反的效果。

我们来计算一个逻辑回归模型。

我得到了一个很简单的函数,叫做logistic。我用的是R GLM函数。

所以我用z除以0表示没有截距。

它就是x×x和x×y。

我将使用二项分布族因为它是logistic回归。

我们得到的是真假二项输出。

然后我可以用那个模型的预测方法来预测我的分数。

然后我要对这个模型进行评估所以辛西娅说过这个。

如果z是真的,我们决定是否为真,分数为真,为正。

如果Z是假的,但实际的分数是真的,这是假阳性。

简单的逻辑,我们可以计算错误然后我们可以用颜色来画。

现在来显示它是否是一个错误点。

所以颜色告诉我们真正的标签形状将会显示我们是否正确地得分。

我们很容易地计算出我们的混淆矩阵的元素。

我们计算了通常的精确精度统计。

回忆一下,用通常的公式打印出来让我帮你运行代码。

我们会看到结果,所以我们有这个混乱矩阵。

你看我们做得很好。

我们有46个真实的阳性结果44个真实的阴性结果只有6个假阳性,在那里它实际上是一个负值,我们说它是阳性的,我们有四个假阴性,在那里真的是阳性。

但是我们说它是负的,如果我们看一下我们的图记住,圆是正确的三角形是错误的,红色是假标签蓝绿色是真正的标签。

所以你可以看到,这个逻辑函数的决定边界必须在这里运行,因为这里我们有一个错误。

这是假阳性,因为它实际上是负的,但它是假阳性。

我们有一些假的底片,这些绿色三角形整体看起来不错。

现在我们将移动这个决策边界。

现在我们来看一下logistic函数的不同概率。

我们已经处理了0。5的默认值但是我们将移动到0。25和0。125。

通过这样做,我们会得到更多真正的好处。

但以假阳性为代价我们将会得到阳性结果。

我们也会让问题变得更棘手。

我们要把真值或正值的质心移动到0.5。5。

对于负的-0。5 -0。5。

所以我们将两个例子的特征集的质心移动得更近,从而使建模问题更加紧密,因此我们只需要对这些概率进行循环,我们就可以计算出一个模型。

我们对这个模型进行评估,看看结果。

这是我们的第一个例子,你知道我们现在的位置在logistic sigmoid的中间我们得到了75%的精确度。

回想一下,在我们的困惑矩阵中,我们仍然有很多真正的优点。

和大部分真正的底片

但我们有很多假阴性和假阳性。

你可以看到发生了什么,因为我们把数据移动得更近了。

这里有很多三角形。

红色三角形为假阳性,绿色的蓝色三角形为假阴性。

你可以想象我们的决策边界在这里。

所以它移动了,它在中间。

现在让我们把它移过去,你可以看到我们得到了更多的阳性结果所有的假阳性都得到了这些红色的三角形。

我们来看一下统计数据。

我们的精确度已下降到百分之七十,我们的精确度基本上下降到百分之六十。

但召回率上升到了。92。

所以我们有了更多的积极因素,但是。

但更多的假阴性结果我们会更彻底地改变这个界限。

这一次我们的精确度下降到。6精度是。56。

召回率很高,98,我们只有一个假阴性。

但是这里有很多假阳性。

你可以看到这个图,这是一个假阴性。

所有的红色三角形都是。

是假阳性

所以我希望这能给你们一些感觉不仅仅是逻辑回归。

我们是怎么做到的,但是分类器又是如何工作的。

在构建机器学习模型时,如何移动决策边界或处理分类器的其他方面会极大地影响您观察到的性能。

发布了30 篇原创文章 · 获赞 50 · 访问量 13万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章