machine learning yearning 第十一章

When to change dev/test sets and metrics 

什么时候要改变开发集/测试集和评价标准呢?

当我在开启一个新项目的时候,我会先选一个开发集和测试集,因为它是我们团队前进的灯塔(前几章)。

When starting out on a new project, I try to quickly choose dev/test sets, since this gives the team a well-defined target to aim for. 

我通常会让我的队友们在一个星期内给定一个初始的开发集和测试集,并且提出评价指标。提出一个不完善的开发集比暂时撂在一边要好得多,也快得多。虽然,在一些成熟的应用领域,一周的时间显然也是不合理的。例如,反垃圾邮件就是一个成熟的深度学习的应用领域,我见过很多团队花费好几个月的时间获取一个更好的开发/测试集。

I typically ask my teams to come up with an initial dev/test set and an initial metric in less than one week—almost never longer. It it better to come up with something imperfect and get going quickly, rather than overthink this. But this one week timeline does not apply to mature applications. For example, anti-spam is a mature deep learning application. I have seen teams working on already-mature systems spend months to acquire even better dev/ test sets. 

如果你之后察觉到你最初的开发集/测试集或者评价指标作用不大,那么你应该马上修改它们。例如,如果你的开发集和指标显示出分类器A的效率在B之上,但你的团队坚持认为分类器B实际上更适合你的产品,这时候,你就该掂量一下是否应该修改你的开发/测试集或者你的评价指标了。

If you later realize that your initial dev/test set or metric missed the mark, by all means change them quickly. For example, if your dev set + metric ranks classifier A above classifier B, but your team thinks that classifier B is actually superior for your product, then this might be a sign that you need to change your dev/test sets or your evaluation metric. 

造成以上现象的主要由三种原因:

There are three main possible causes of the dev set/metric incorrectly rating classifier A higher: 

     1.开发集和测试集的来源不同于你所真正想要实现的内容。

  1. The actual distribution you need to do well on is different from the dev/test sets.

假设初始的开发集和测试集中成年的猫咪所占比例较大。但你的app在运行的时候,却发现大部分的用户上传的猫咪图象是小猫,而不是预期的“大”猫。所以,开发集和测试集已经不具有代表性了。这种情况下,更新你的开发/测试集才是王道。

Suppose your initial dev/test set had mainly pictures of adult cats. You ship your cat app, and find that users are uploading a lot more kitten images than expected. So, the dev/test set distribution is not representative of the actual distribution you need to do well on. In this case, update your dev/test sets to be more representative. 

      2.你过度拟合了开发集

      2.You have overfit to the dev set.

反复在开发集上提高算法的过程实际上也是逐渐“过度评估”的过程。如果你发现在开发集上模型的表现比在测试集上表现得更加优越,那么很有可能是因为你的模型过度拟合了。在这种情况下,你应该找到一个新的开发集。

The process of repeatedly evaluating ideas on the dev set causes your algorithm to gradually “overfit” to the dev set. When you are done developing, you will evaluate your system on the test set. If you find that your dev set performance is much better than your test set performance, it is a sign that you have overfit to the dev set. In this case, get a fresh dev set. 

如果你需要查看你团队的进度,你可以有规律地,例如每周或每个月,在测试集上评估一下你的系统。但是千万不要依据测试集上的表现,做出某些关于算法的决定,包括是否返回使用上一周的系统。如果你这样做,你这是在将模型过度拟合测试集了,并且这样你也无法指望你能够给出一个毫无偏差的对系统性能的估测了。(这一点在你需要发表研究论文,或是应用指标做决策时更显重要)。

If you need to track your team’s progress, you can also evaluate your system regularly—say once per week or once per month—on the test set. But do not use the test set to make any decisions regarding the algorithm, including whether to roll back to the previous week’s system. If you do so, you will start to overfit to the test set, and can no longer count on it to give a completely unbiased estimate of your system’s performance (which you would need if you’re publishing research papers, or perhaps using this metric to make important business decisions). 

    3.指标所评价的,不是项目需要提高的东西。

    3.The metric is measuring something other than what the project needs to optimize.

假如对于你的猫咪的app,你的评价指标是分类的准确率。这个指标显示,分类器A比分类器B更优越。但是假如你亲自尝试了这两个分类器,发现分类器A将一些18禁的图片也放分进来了。尽管分类器A的准确率更高,但是让18禁的图片“潜”入的坏印象,让你觉得分类器A的表现令人无法接受,你改怎么办?

Suppose that for your cat application, your metric is classification accuracy. This metric currently ranks classifier A as superior to classifier B. But suppose you try out both algorithms, and find classifier A is allowing occasional pornographic images to slip through. Even though classifier A is more accurate, the bad impression left by the occasional pornographic image means its performance is unacceptable. What do you do? 

这里,评估指标不能辨别分类器B比A实际上更适合你的产品。所以,你不再信任指标了。这时候就应该更改评价指标。例如,你可以让指标惩罚那些让18禁图片进入的算法。强烈建议你选择一个全新的指标,并用这个新的指标为你的团队明确一个新的目标,不要在你不信任的指标上越走越远然后转而自己手动选择。

Here, the metric is failing to identify the fact that Algorithm B is in fact better than Algorithm A for your product. So, you can no longer trust the metric to pick the best algorithm. It is time to change evaluation metrics. For example, you can change the metric to heavily penalize letting through pornographic images.  I would strongly recommend picking a new metric and using the new metric to explicitly define a new goal for the team, rather than proceeding for too long without a trusted metric and reverting to manually choosing among classifiers.  

在开发的项目的时候,修改开发/测试集是很常见的。有一个初始的开发/测试集将会对你的开发效率大有所益。如果你发现,你的开发/测试集和评价指标不太起作用了,那么修改它们吧,让它们为你们的团队指引方向!

It is quite common to change dev/test sets or evaluation metrics during a project. Having an initial dev/test set and metric helps you iterate quickly. If you ever find that the dev/test sets or metric are no longer pointing your team in the right direction, it’s not a big deal! Just change them and make sure your team knows about the new direction.  

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章