機器學習常見的幾個誤區--邏輯迴歸的變量之間如果線性相關

下面羅列的幾個在機器學習算法實際應用中誤區,解決了我很多困惑,推薦大家讀一下

Machine Learning Done Wrong

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.

As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.

1. Take default loss function for granted

Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

3. Forget about outliers

Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

4. Use high variance model when n<<p

SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) --  common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

5. L1/L2/... regularization without standardization

Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.

Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.

So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.

原文地址:http://ml.posthaven.com/machine-learning-done-wrong
==================================================
翻譯:

機器學習實踐中的7種常見錯誤

2014年06月18日 ⁄ 字號   

統計建模非常像工程學。

在工程學中,有多種構建鍵-值存儲系統的方式,每個設計都會構造一組不同的關於使用模式的假設集合。在統計建模中,有很多分類器構建算法,每個算法構造一組不同的關於數據的假設集合。

當處理少量數據時,嘗試儘可能多的算法,然後挑選最好的一個的做法是比較合理的,因爲此時實驗成本很低。但當遇到“大數據”時,提前分析數據,然後設計相應“管道”模型(預處理,建模,優化算法,評價,產品化)是值得的。

正如我之前文章中所指出的,有很多種方法來解決一個給定建模問題。每個模型做出不同假設,如何導引和確定哪些假設合理的方法並不明確。在業界,大多數實踐者是挑選他們更熟悉而不是最合適的建模算法。在本文中,我想分享一些常見錯誤(不能做的),並留一些最佳實踐方法(應該做的)在未來一篇文章中介紹。

 

1. 想當然地使用缺省損失函數

許多實踐者使用缺省損失函數(如,均方誤差)訓練和挑選最好的模型。實際上,現有損失函數很少符合業務目標。以欺詐檢測爲例,當試圖檢測欺詐性交易時,業務目標是最小化欺詐損失。現有二元分類器損失函數爲誤報率和漏報率分配相等權重,爲了符合業務目標,損失函數懲罰漏報不僅要多於懲罰誤報,而且要與金額數量成比例地懲罰每個漏報數據。此外,欺詐檢測數據集通常含有高度不平衡的標籤。在這些情況下,偏置損失函數能夠支持罕見情況(如,通過上、下采樣)。

2.非線性情況下使用簡單線性模型

當構建一個二元分類器時,很多實踐者會立即跳轉到邏輯迴歸,因爲它很簡單。但是,很多人也忘記了邏輯迴歸是一種線性模型,預測變量間的非線性交互需要手動編碼。回到欺詐檢測問題,要獲得好的模型性能,像“billing address = shipping address and transaction amount < $50”這種高階交互特徵是必須的。因此,每個人都應該選擇適合高階交互特徵的帶核SVM或基於樹的分類器。

3.忘記異常值

異常值非常有趣,根據上下文環境,你可以特殊關注或者完全忽略它們。以收入預測爲例,如果觀察到不同尋常的峯值收入,給予它們額外關注並找出其原因可能是個好主意。但是如果異常是由於機械誤差,測量誤差或任何其它不可歸納的原因造成的,那麼在將數據輸入到建模算法之前忽略掉這些異常值是個不錯的選擇。

相比於其它模型,有些模型對異常值更爲敏感。比如,當決策樹算法簡單地將每個異常值計爲一次誤分類時,AdaBoost算法會將那些異常值視爲“硬”實例,併爲異常值分配極大權值。如果一個數據集含有相當數量的異常值,那麼,使用一種具有異常值魯棒性的建模算法或直接過濾掉異常值是非常重要的。

4.樣本數少於特徵數(n<<p)時使用高方差模型

SVM是現有建模算法中最受歡迎算法之一,它最強大的特性之一是,用不同核函數去擬合模型的能力。SVM核函數可被看作是一種自動結合現有特徵,從而形成一個高維特徵空間的方式。由於獲得這一強大特性不需任何代價,所以大多數實踐者會在訓練SVM模型時默認使用核函數。然而,當數據樣本數遠遠少於特徵數(n<<p)—業界常見情況如醫學數據—時,高維特徵空間意味着更高的數據過擬合風險。事實上,當樣本數遠小於特徵數時,應該徹底避免使用高方差模型。

5.尚未標準化就進行L1/L2/等正則化

使用L1或L2去懲罰大系數是一種正則化線性或邏輯迴歸模型的常見方式。然而,很多實踐者並沒有意識到進行正則化之前標準化特徵的重要性。

回到欺詐檢測問題,設想一個具有交易金額特徵的線性迴歸模型。不進行正則化,如果交易金額的單位爲美元,擬合係數將是以美分爲單位時的100倍左右。進行正則化,由於L1/L2更大程度上懲罰較大系數,如果單位爲美元,那麼交易金額將受到更多懲罰。因此,正則化是有偏的,並且趨向於在更小尺度上懲罰特徵。爲了緩解這個問題,標準化所有特徵並將它們置於平等地位,作爲一個預處理步驟。

6. 不考慮線性相關直接使用線性模型

設想建立一個具有兩變量X1和X2的線性模型,假設真實模型是Y=X1+X2。理想地,如果觀測數據含有少量噪聲,線性迴歸解決方案將會恢復真實模型。然而,如果X1和X2線性相關(大多數優化算法所關心的),Y=2*X1, Y=3*X1-X2或Y=100*X1-99*X2都一樣好,這一問題可能並無不妥,因爲它是無偏估計。然而,它卻會使問題變得病態,使係數權重變得無法解釋。

7. 將線性或邏輯迴歸模型的係數絕對值解釋爲特徵重要性

因爲很多現有線性迴歸量爲每個係數返回P值,對於線性模型,許多實踐者認爲,係數絕對值越大,其對應特徵越重要。事實很少如此,因爲:(a)改變變量尺度就會改變係數絕對值;(b)如果特徵是線性相關的,則係數可以從一個特徵轉移到另一個特徵。此外,數據集特徵越多,特徵間越可能線性相關,用係數解釋特徵重要性就越不可靠。

 

這下你就知道了機器學習實踐中的七種常見錯誤。這份清單並不詳盡,它只不過是引發讀者去考慮,建模假設可能並不適用於手頭數據。爲了獲得最好的模型性能,挑選做出最合適假設的建模算法—而不只是選擇你最熟悉那個算法,是很重要的。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章