Accuracy index of object detection: F1 & IoU

Reference:

https://stats.stackexchange.com/questions/273537/f1-dice-score-vs-iou

https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

 

Definition:

IoU(Intersection over Union) / Jaccard:

TP/(TP+FP+FN)

F1 score/Dice:

2TP/(2TP+FP+FN)

         Thefollowing fig refers to IoU.

 

More explanation:

From the definition of the two metrics, we have that IoU and F score are always within a factor of 2 of each other:

F/2≤IoU≤F

F/2≤IoU≤F

and also that they meet at the extremes of one and zero under the conditions that you would expect (perfect match and completely disjoint).

Note also that the ratio between them can be related explicitly to the IoU:

IoU/F=1/2+IoU/2

IoU/F=1/2+IoU/2

so that the ratio approaches 1/2 as both metrics approach zero.

But there's a stronger statement that can be made for the typical application of classification a la machine learning. For any fixed "ground truth", the two metrics are always positively correlated. That is to say that if classifier A is better than B under one metric, it is also better than classifier B under the other metric.

It is tempting then to conclude that the two metrics are functionally equivalent to the choice between them is arbitrary, but not so fast! The problem comes when taking the average score over a set of inferences. Then the difference emerges when quantifying how much worse classifier B is than A for any given case.

In general, the IoU metric tends to penalize single instances of bad classification more than the F score quantitatively even when they can both agree that this one instance is bad. Similarly to how L2 can penalize the largest mistakes more than L1, the IoUmetric tends to have a "squaring" effect on the errors relative to the F score. So the F score tends to measure something closer to average performance, while the IoU score measures something closer to the worst case performance.

Suppose for example that the vast majority of the inferences are moderately better with classifier A than B, but some of them of them are significantly worse using classifier A. It may be the case then that the F metric favors classifier A while the IoU metric favors classifier B.

To be sure, both of these metrics are much more alike than they are different. But both of them suffer from another disadvantage from the standpoint of taking averages of these scores over many inferences: they both overstate the importance of sets with little-to-no actual ground truth positive sets. In the common example of image segmentation, if an image only has a single pixel of some detectable class, and the classifier detects that pixel and one other pixel, its F score is a lowly 2/3 and the IoU is even worse at 1/2. Trivial mistakes like these can seriously dominate the average score taken over a set of images. In short, it weights each pixel error inversely proportionally to the size of the selected/relevant set rather than treating them equally.

There is a far simpler metric that avoids this problem. Simply use the total error: FN + FP (e.g. 5% of the image's pixels were miscategorized). In the case where one is more important than the other, a weighted average may be used: c0FP + c1FN.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章