



  • 你有太多的數據。可以考慮通過構建學習曲線(learning curves)來預估樣本數據集(representative sample)的大小或者使用大數據的框架把所有的可得數據都用上。
  • 你有太少的數據。首先確定你的數據量確實比較少。那麼可以考慮嘗試收集更多的數據或者用數據增強的方法(data augmentation methods)來人爲的增加數據樣本大小
  • 你還沒有開始收集數據?你需要開始手機數據並且評估這些數據是否足夠。如果你是在做一個研究或者數據收集太昂貴,你可以和領域內的專家或者統計學家聊一聊。

在我自己實際工作中,我經常應用學習曲線,在小數據集上應用重新採樣的方法(resampling methods)比如k-fold 交叉驗證和bootstrap,和在最終結果中增加置信區間。


1. 不能一概而論,需要分論討論

No one can tell you how much data you need for your predictive modeling problem.沒有人可以在不瞭解你的項目的情況下告訴你,你究竟需要多少訓練數據。這個一個棘手的問題,你經常需要通過經驗調查來得到答案


  • 你要解決問題的複雜程度, nominally the unknown underlying function that best relates your input variables to the output variable.
  • 學習算法的複雜程度, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.

.2. 通過學習別人的經驗進行類比


你還可以研究他們關於數據量大小對算法表現的影響的文章。你可以在google, Google Scholar 和Arxiv上搜索文章

3. 用你的領域的專業知識

You need a sample of data from your problem that is representative of the problem you are trying to solve.

In general, the examples must be independent and identically distributed.

Remember, in machine learning we are learning a function to map input data to output data. The mapping function learned will only be as good as the data you provide it from which to learn.

This means that there needs to be enough data to reasonably capture the relationships that may exist both between input features and between input features and output features.

Use your domain knowledge, or find a domain expert and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the problem.

4. 應用統計式啓發

There are statistical heuristic methods available that allow you to calculate a suitable sample size.

Most of the heuristics I have seen have been for classification problems as a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely ad hoc.

Here are some examples you may consider:

  • Factor of the number of classes: There must be x independent examples for each class, where x could be tens, hundreds, or thousands (e.g. 5, 50, 500, 5000).
  • Factor of the number of input features: There must be x% more examples than there are input features, where x could be tens (e.g. 10).
  • Factor of the number of model parameters: There must be x independent examples for each parameter in the model, where x could be tens (e.g. 10).

They all look like ad hoc scaling factors to me.

Have you used any of these heuristics?
How did it go? Let me know in the comments.

In theoretical work on this topic (not my area of expertise!), a classifier (e.g. k-nearest neighbors) is often contrasted against the optimal Bayesian decision rule and the difficulty is characterized in the context of the curse of dimensionality; that is there is an exponential increase in difficulty of the problem as the number of input features is increased.

For example:

Findings suggest avoiding local methods (like k-nearest neighbors) for sparse samples from high dimensional problems (e.g. few samples and many input features).

For a kinder discussion of this topic, see:

5. 非線性算法一般需要更多數據

The more powerful machine learning algorithms are often referred to as nonlinear algorithms.

By definition, they are able to learn complex nonlinear relationships between input and output features. You may very well be using these types of algorithms or intend to use them.

These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your problem in addition to the values of those parameters). They are also high-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training data, often a lot more data.

In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data.

If a linear algorithm achieves good performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like random forest, or an artificial neural network.

6. 評估數據集大小和模型表現

It is common when developing a new machine learning algorithm to demonstrate and even explain the performance of the algorithm in response to the amount of data or problem complexity.

These studies may or may not be performed and published by the author of the algorithm, and may or may not exist for the algorithms or problem types that you are working with.

I would suggest performing your own study with your available data and a single well-performing algorithm, such as random forest.

Design a study that evaluates model skill versus the size of the training dataset.

Plotting the result as a line plot with training dataset size on the x-axis and model skill on the y-axis will give you an idea of how the size of the data affects the skill of the model on your specific problem.

This graph is called a learning curve.

From this graph, you may be able to project the amount of data that is required to develop a skillful model, or perhaps how little data you actually need before hitting an inflection point of diminishing returns.

I highly recommend this approach in general in order to develop robust models in the context of a well-rounded understanding of the problem.

7. 天真的猜測

You need lots of data when applying machine learning algorithms.

Often, you need more data than you may reasonably require in classical statistics.

I often answer the question of how much data is required with the flippant response:

Get and use as much data as you can.

If pressed with the question, and with zero knowledge of the specifics of your problem, I would say something naive like:

  • You need thousands of examples.
  • No fewer than hundreds.
  • Ideally, tens or hundreds of thousands for “average” modeling problems.
  • Millions or tens-of-millions for “hard” problems like those tackled by deep learning.

Again, this is just more ad hoc guesstimating, but it’s a starting point if you need it. So get started!

8. Get More Data (No Matter What!?)

Big data is often discussed along with machine learning, but you may not require big data to fit your predictive model.

Some problems require big data, all the data you have. For example, simple statistical machine translation:

If you are performing traditional predictive modeling, then there will likely be a point of diminishing returns in the training set size, and you should study your problems and your chosen model/s to see where that point is.

Keep in mind that machine learning is a process of induction. The model can only capture what it has seen. If your training data does not include edge cases, they will very likely not be supported by the model.



Learn something, then take action to better understand what you have with further analysis, extend the data you have with augmentation, or gather more data from your domain.


This section provides more resources on the topic if you are looking go deeper.

There is a lot of discussion around this question on Q&A sites like Quora, StackOverflow, and CrossValidated. Below are few choice examples that may help.

I expect that there are some great statistical studies on this question; here are a few I could find.

Other related articles.

If you know of more, please let me know in the comments below.


In this post, you discovered a suite of ways to think and reason about the problem of answering the common question:

How much training data do I need for machine learning?

Did any of these methods help?
Let me know in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Except, of course, the question of how much data that you specifically need


還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.