Deep Neural Networks的Tricks~~翻譯版~~精華

Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

主要以下面八個部分展開介紹：

mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Networks; 4) some tips during training; 5) selections of activation functions; 6) diverse regularizations; 7)some insights found from figures and finally 8) methods of ensemble multiple deep networks.

1，數據擴增

2.預處理數據

3.初始化網絡

4，在訓練中的一些tips

5,合理的選擇激活函數

6.多種正則化

7，從實驗圖和結果發現insights

8，如何集合多個網絡

依次介紹八種方法：

一、data augmentation

1. th additiarhorizontally flipping（水平翻轉）, random crops(隨機切割) and color jittering(顏色抖動). Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

2、Krizhevsky et al. [1] proposed fancy PCA。you can firstly perform PCA on the set of RGB pixel values throughout your training images. add the following quantity to each RGB image pixel (i.e., $I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T$ ): $bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T$ where, $bf{p}_i$ and are the -th eigenvector and eigenvalue of the covariance matrix of RGB pixel values, respectively, and is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. 。。

二、Pre-processing

1、The first and simple pre-processing approach is zero-center the data, and then normalize them。

code：

>>> X -= np.mean(X, axis = 0) # zero-center
>>> X /= np.std(X, axis = 0) # normalize

2、re-processing approach similar to the first one is PCA Whitening.

>>> X -= np.mean(X, axis = 0) # zero-center
>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
>>> Xrot = np.dot(X, U) # decorrelate the data

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

上面兩種方法：these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.

三、初始化-Initialization

1.All Zero Initialization---假如全部權值都設爲0或相同的數，就會計算相同梯度和相同的參數更新，即沒有對稱性

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

2、Initialization with Small Random Numbers

依據：仍然期望各參數接近0，符合對稱分佈，選取

來設各個參數，但最後的效果沒有實質性提高。

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like

, where

is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

3、Calibrating the Variances 調整各個方差，每個細胞源輸出的方差歸到1，通過除以輸入源的個數的平方

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out thatyou can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

4、Current Recommendation 當前流行的方法。是文獻[4]神經元方差設定爲2/n.n是輸入個數。所以對權值w的處理是，正態分佈上的採樣數乘以sqrt(2.0/n)

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be as:

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InICCV, 2015.

四、訓練過程

1、Filters and pooling size. 濾波子大小和尺化大小的設定

the size of input images prefers to be power-of-2, such as32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. Moreover, it is important to employ asmall filter (e.g.,

) and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e.,

filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common usedpooling size is of

2、Learning rate. 建議學習率，gradients除以batch size 。在沒有改變mini batch時，最好別改變lr.

開始lr設定爲0.1~~利用validation set來確定Lr,再每次除以2或5

In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

3、Fine-tune on pre-trained models，微調和預訓練，直接利用已經公佈的一些模型：Caffe Model Zoo and VGG Group。

結合這些模型用於新的數據集上，需要fine-tune，需要考慮兩個重要因子：數據集大小和與原數據的相似度。

For further improving the classification performance on your data set, a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from the top layers of pre-trained models. 微調分兩種情況：第一種：如果新數據集少，且分佈類似預訓練的庫（現實是殘酷的，不太可能），只需要調整最後一層的線性分類器。

If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate.

如果有很多數據，就用小的LR調整模塊的最後幾層

However, if your own data set is quite different from the data used in pre-trained models but with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance.

如果你的數據與預模型不同，但數量充足，用一個小的Lr對很多層進行調整

However, if your data set not only contains little data, but is very different from the data used in pre-trained models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.

假如，數據少且不同與源數據模型，這就會很複雜。僅僅靠訓練分類器肯定不行。也許可以對網絡中前幾層的激活層和特徵層做SVM分類器訓練。

五、 selections of activation functions;合理選擇激活函數

One of the crucial factors in deep networks is activation function, which brings the non-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section.

本圖取之：http://cs231n.stanford.edu/index.html

幾種激活函數：

Sigmoid：

The sigmoid non-linearity has the mathematical form $sigma(x)=1/(1+e^{-x})$ . It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).在最大閾值1時，就達到飽和--Saturated.

sigmoid已經失寵，因爲他的兩個缺點：

（1）.Sigmoids saturate and kill gradients. 由於飽和而失去了梯度

因爲在when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero。看圖就知道，整個曲線的傾斜角度，在兩端傾斜角都是平的。

關鍵的問題在於this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective。這樣就會因爲local gradient 太小，而it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights

and recursively to its data. 影響到梯度，導致沒有信號能通過神經元傳遞給權值。而且還需要小心關注初始權值，one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.因爲初始的權值太大，就會讓神經元直接飽和，整個網絡難以學習。

(2) .Sigmoid outputs are not zero-centered. 不是以0爲中心

This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that isnot zero-centered. This has implications on the dynamics duringgradient descent, because if the data coming into a neuron is always positive(e.g., element wise in ), then the gradient on the weights will during back-propagation become either all be positive, or all negative(depending on the gradient of the whole expression ).

這樣在後幾層網絡中接受的值也不是0中心，這樣在動態梯度下降法中，如果進入nueron中的數據都是正的，那麼整個權值梯度w要不全爲正，或者全爲負（取決於f的表達形式）。

This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

這回導致鋸齒狀的動態梯度，但如果在一個batch數據中將梯度求和來更新權值，有可能會相互抵消，從而緩解上訴的影響。這筆飽和激活帶來的影響要輕太多了！

Tanh(x)

The tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

tanh的作用是將真個實數數據放到了[-1,1]之間，他的激活依舊是飽和狀態，但他的輸出是0中心。

Rectified Linear Unit

The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function

, which is simply thresholded at zero.

Relu 有一些優點和缺點：

There are several pros and cons to using the ReLUs:

(Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.

運算簡單，非指數形式，切不會飽和
(Pros) It was found to greatly accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.

已被證明可以加速隨機梯度收斂，被認爲是由於其線性和非飽和形式（有待考證）
(Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

缺點：Relu Unit在訓練中容易die，例如一個大的梯度流過nueron，會導致部分unit一直爲0，例如，lr設置很高時，你的網絡又40%的neuro未被激活。

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when , a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes if and if , where is a small constant. Some people report success with this form of activation function, but the results are not always consistent. 修改了x<0部分，設定爲了一個常數a,

後續又出來了一連串的RELU的修改：

ReLU, Leaky ReLU, PReLU and RReLU. In these figures, for PReLU, is learned and for Leaky ReLU is fixed. For RReLU, $alpha_{ji}$ is a random variable keeps sampling in a given range, and remains fixed in testing.

PReLU的a是學習得到，RReLU的a是隨機採樣變換。在測試中是固定。

B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network. In ICML Deep Learning Workshop, 2015.

文獻給出了各個激活函數的表現：

From these tables, we can find the performance of ReLU is not the best for all the three data sets. For Leaky ReLU, a larger slope will achieve better accuracy rates. PReLU is easy to overfit on small data sets (its training error is the smallest, while testing error is not satisfactory), but still outperforms ReLU. In addition, RReLU is significantly better than other activation functions on NDSB, which shows RReLU can overcome overfitting, because this data set has less training data than that of CIFAR-10/CIFAR-100. In conclusion, three types of ReLU variants all consistently outperform the original ReLU in these three data sets. And PReLU and RReLU seem better choices. Moreover, He et al. also reported similar conclusions in [4]

[4]K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InICCV, 2015.

Regularizations

There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight in the network, we add the term $frac{1}{2}lambda w^2$ to the objective, where is the regularization strength. It is common to see the factor of $frac{1}{2}$ in front because then the gradient of this term with respect to the parameter is simply instead of . The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors.
L1 regularization is another relatively common form of regularization, where for each weight we add the term to the objective. It is possible to combine the L1 regularization with the L2 regularization: (this is called Elastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector of every neuron to satisfy $parallel vec{w} parallel_2 <c$ . Typical values of are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.
Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [6] that complements the other methods (L1, L2, maxnorm). During training, dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section). In practice, the value of dropout ratio is a reasonable default, but this can be tuned on validation data.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(Jun):1929−1958, 2014.

七：ome insights found from figures and finally訓練中重要的圖

1.As we have known, the learning rate is very sensitive. From Fig. 1 in the following, a very high learning rate will cause a quite strange loss curve. A low learning rate will make your training loss decrease very slowly even after a large number of epochs. In contrast, a high learning rate will make training loss decrease fast at the beginning, but it will also drop into a local minimum. Thus, your networks might not achieve a satisfactory results in that case. For a good learning rate, as the red line shown in Fig. 1, its loss curve performs smoothly and finally it achieves the best performance.

不同的LR設定會帶來不同的Loss效果，需要合理的選擇一個lr

Now let’s zoom in the loss curve. The epochs present the number of times for training once on the training data, so there are multiple mini batches in each epoch. If we draw the classification loss every training batch, the curve performs like Fig. 2. Similar to Fig. 1, if the trend of the loss curve looks too linear, that indicates your learning rate is low; if it does not decrease much, it tells you that the learning rate might be too high. Moreover, the “width” of the curve is related to the batch size. If the “width” looks too wide, that is to say the variance between every batch is too large, which points out you should increase the batch size.

每一個epoch中又多個batchsize的循環。下圖縱座標是loss,橫座標是epoch，每一個epoch縱向藍色直線就是一個循環epoch內每個Batchsize對應的Loss。如果看起來太線性說明Lr太低，如果沒有降低太多，說明Lr太高。整個藍色直線的width與batch size的大小有關，如果她看着太寬了，可能就需要增加batch size, 這樣就會降低width==num/batchsize.

Another tip comes from the accuracy curve. As shown in Fig. 3, the red line is the training accuracy, and the green line is the validation one. When the validation accuracy converges, the gap between the red line and the green one will show the effectiveness of your deep networks. If the gap is big, it indicates your network could get good accuracy on the training data, while it only achieve a low accuracy on the validation set. It is obvious that your deep model overfits on the training set. Thus, you should increase the regularization strength of deep networks. However, no gap meanwhile at a low accuracy level is not a good thing, which shows your deep model has low learnability. In that case, it is better to increase the model capacity for better results.再說test data與 validation data之間的關係，gap太大，就導致了overfitting。

八：Ensemble

In machine learning, ensemble methods [8] that train multiple learners and then combine them for use are a kind of state-of-the-art learning approach. It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks. In practical applications, especially challenges or competitions, almost all the first-place and second-place winners used ensemble methods.

Here we introduce several skills for ensemble in the deep learning scenario.

Same model, different initialization. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.
Top models discovered during cross-validation. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g., 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it does not require additional retraining of models after cross-validation. Actually, you could directly select several state-of-the-art deep models from Caffe Model Zoo to perform ensemble.
Different checkpoints of a single model. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.
Some practical examples. If your vision tasks are related to high-level image semantic, e.g., event recognition from still images, a better ensemble method is to employ multiple deep models trained on different data sources to extract different and complementary deep representations. For example in the Cultural Event Recognition challenge in associated with ICCV’15, we utilized five different deep models trained on images of ImageNet, Place Database and the cultural images supplied by the competition organizers. After that, we extracted five complementary deep features and treat them as multi-view data. Combining “early fusion” and “late fusion” strategies described in [7], we achieved one of the best performance and ranked the 2nd place in that challenge. Similar to our work,[9] presented the Stacked NN framework to fuse more deep networks at the same time.

九、混雜

In real world applications, the data is usually class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images.

類別不平均問題：一些類擁有大量的訓練數據，一類數據量有限

As discussed in a recent technique report [10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks.

不平衡的訓練數據對整個網絡有負面效果

For this issue, the simplest method is to balance the training data by directly up-sampling and down-sampling the imbalanced data, which is shown in [10].

一種解決方法：直接上採樣和下采樣數據，

Another interesting solution is one kind of special crops processing in our challenge solution [7]. Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem.

另一種:解決辦法如文獻{7}中，採用剪切的辦法

[7]X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In ICCV ChaLearn Looking at People Workshop, 2015.

In addition, you can adjust the fine-tuning strategy for overcoming class-imbalance. For example, you can divide your own data set into two parts: one contains the classes which have a large number of training samples (images/crops); the other contains the classes of limited number of samples. In each part, the class-imbalanced problem will be not very serious. At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.

第三種方法：採用fine-tuning策略，將數據分割爲兩個部分，大數據集合小數據集，先微調大的類，再微調小類。

[10] P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Degree Project in Computer Science, DD143X, 2015.

References & Source Links

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012
A Brief Overview of Deep Learning, which is a guest post by Ilya Sutskever.
CS231n: Convolutional Neural Networks for Visual Recognition of Stanford University, held by Prof. Fei-Fei Li and Andrej Karpathy.
K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In ICCV, 2015.
B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network. In ICML Deep Learning Workshop, 2015.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(Jun):1929−1958, 2014.
X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In ICCV ChaLearn Looking at People Workshop, 2015.
Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: Chapman & HallCRC/, 2012. (ISBN 978-1-439-830031)
M. Mohammadi, and S. Das. S-NN: Stacked Neural Networks. Project in Stanford CS231n Winter Quarter, 2015.
P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Degree Project in Computer Science, DD143X, 2015.

Deep Neural Networks的Tricks翻譯版精華

1.All Zero Initialization---假如全部權值都設爲0或相同的數，就會計算相同梯度和相同的參數更新，即沒有對稱性

2、Initialization with Small Random Numbers

Rectified Linear Unit

Regularizations

References & Source Links

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

CV大牛部分表格

Deep Learning方向的paper整理(1)

機器學習資料收集（持續更新）--書籍--個人主頁

機器學習的學習資源--入門書-進階書-入門視頻-繼續閱讀推薦

Eeeplearning-正則化方法--L1和L2 regularization、數據集擴增、dropout

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結