如何處理Pandas中的SettingWithCopyWarning?

本文翻譯自:How to deal with SettingWithCopyWarning in Pandas?

Background 背景

I just upgraded my Pandas from 0.11 to 0.13.0rc1. 我剛剛將熊貓從0.11升級到0.13.0rc1。 Now, the application is popping out many new warnings. 現在,該應用程序彈出許多新警告。 One of them like this: 其中之一是這樣的:

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE

I want to know what exactly it means? 我想知道到底是什麼意思? Do I need to change something? 我需要改變什麼嗎?

How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE ? 如果我堅持使用quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE應該如何暫停警告?

The function that gives errors 產生錯誤的功能

def _decode_stock_quote(list_of_150_stk_str):
    """decode the webpage and return dataframe"""

    from cStringIO import StringIO

    str_of_all = "".join(list_of_150_stk_str)

    quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
    quote_df['TClose'] = quote_df['TPrice']
    quote_df['RT']     = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
    quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
    quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
    quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
    quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
    quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

    return quote_df

More error messages 更多錯誤訊息

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

#1樓

參考:https://stackoom.com/question/1OXeg/如何處理Pandas中的SettingWithCopyWarning


#2樓

The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which don't always work as expected, particularly when the first selection returns a copy . 創建SettingWithCopyWarning是標記可能引起混淆的“鏈接”分配,例如以下分配,這些分配並非總是按預期工作,尤其是當第一個選擇返回一個copy時 [see GH5390 and GH5597 for background discussion.] [有關背景討論,請參閱GH5390GH5597 。]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df

The warning offers a suggestion to rewrite as follows: 該警告提出瞭如下重寫建議:

df.loc[df['A'] > 2, 'B'] = new_val

However, this doesn't fit your usage, which is equivalent to: 但是,這不適合您的用法,相當於:

df = df[df['A'] > 2]
df['B'] = new_val

While it's clear that you don't care about writes making it back to the original frame (since you overwrote the reference to it), unfortunately this pattern can not be differentiated from the first chained assignment example, hence the (false positive) warning. 很明顯,您不必在意將其寫回到原始框架的寫操作(因爲您重寫了對它的引用),但是不幸的是,這種模式無法與第一個鏈式分配示例區分開,因此(誤報)警告。 The potential for false positives is addressed in the docs on indexing , if you'd like to read further. 如果您想進一步閱讀,可能會在建立索引文檔中解決誤報的可能性。 You can safely disable this new warning with the following assignment. 您可以通過以下分配安全地禁用此新警告。

pd.options.mode.chained_assignment = None  # default='warn'

#3樓

In general the point of the SettingWithCopyWarning is to show users (and especially new users) that they may be operating on a copy and not the original as they think. 通常, SettingWithCopyWarning是向用戶(尤其是新用戶)顯示他們可能正在使用副本,而不是他們認爲的那樣。 There are false positives (IOW if you know what you are doing it could be ok ). 誤報(IOW如果你知道你在做什麼,它可能是確定 )。 One possibility is simply to turn off the (by default warn ) warning as @Garrett suggest. 一種可能性就是按照@Garrett的建議簡單地關閉(默認爲警告 )警告。

Here is another option: 這是另一個選擇:

In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [2]: dfa = df.ix[:, [1, 0]]

In [3]: dfa.is_copy
Out[3]: True

In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  #!/usr/local/bin/python

You can set the is_copy flag to False , which will effectively turn off the check, for that object : 您可以將is_copy標誌設置爲False ,這將有效地關閉該對象的檢查:

In [5]: dfa.is_copy = False

In [6]: dfa['A'] /= 2

If you explicitly copy then no further warning will happen: 如果您明確複製,則不會發生進一步的警告:

In [7]: dfa = df.ix[:, [1, 0]].copy()

In [8]: dfa['A'] /= 2

The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. OP在上面顯示的代碼是合法的,並且可能是我也可以做的,但從技術上講,此警告是一種情況,不是誤報。 Another way to not have the warning would be to do the selection operation via reindex , eg 沒有警告的另一種方法是通過reindex進行選擇操作,例如

quote_df = quote_df.reindex(columns=['STK', ...])

Or, 要麼,

quote_df = quote_df.reindex(['STK', ...], axis=1)  # v.0.21

#4樓

Pandas dataframe copy warning 熊貓數據框複製警告

When you go and do something like this: 當您去做這樣的事情時:

quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

pandas.ix in this case returns a new, stand alone dataframe. 在這種情況下, pandas.ix返回一個新的獨立數據pandas.ix

Any values you decide to change in this dataframe, will not change the original dataframe. 您決定在此數據框中更改的任何值都不會更改原始數據框。

This is what pandas tries to warn you about. 這就是熊貓試圖警告您的內容。


Why .ix is a bad idea 爲什麼.ix是個壞主意

The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell. .ix對象嘗試做的事情不只一件事,而且對於任何閱讀過乾淨代碼的人來說,這是一種強烈的氣味。

Given this dataframe: 給定此數據框:

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})

Two behaviors: 兩種行爲:

dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2

Behavior one: dfcopy is now a stand alone dataframe. 行爲一: dfcopy現在是一個獨立的數據dfcopy Changing it will not change df 更改它不會更改df

df.ix[0, "a"] = 3

Behavior two: This changes the original dataframe. 行爲二:更改原始數據框。


Use .loc instead 使用.loc代替

The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. 熊貓開發者認識到.ix對象很臭(推測地),因此創建了兩個新對象,這些對象有助於數據的訪問和分配。 (The other being .iloc ) (另一個是.iloc

.loc is faster, because it does not try to create a copy of the data. .loc更快,因爲它不會嘗試創建數據副本。

.loc is meant to modify your existing dataframe inplace, which is more memory efficient. .loc旨在就地修改您現有的數據幀,從而提高內存效率。

.loc is predictable, it has one behavior. .loc是可預測的,它具有一種行爲。


The solution 解決方案

What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller. 在代碼示例中,您正在執行的操作是加載一個包含許多列的大文件,然後將其修改爲較小的文件。

The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster. pd.read_csv函數可以幫助您解決很多問題,並使文件加載更快。

So instead of doing this 所以不要這樣做

quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

Do this 做這個

columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns

This will only read the columns you are interested in, and name them properly. 這隻會讀取您感興趣的列,並正確命名它們。 No need for using the evil .ix object to do magical stuff. 無需使用邪惡的.ix對象來做神奇的事情。


#5樓

If you have assigned the slice to a variable and want to set using the variable as in the following: 如果您已將切片分配給變量,並希望使用變量進行設置,如下所示:

df2 = df[df['A'] > 2]
df2['B'] = value

And you do not want to use Jeffs solution because your condition computing df2 is to long or for some other reason, then you can use the following: 而且您不想使用Jeffs解決方案,因爲條件計算df2太長或出於某些其他原因,那麼您可以使用以下命令:

df.loc[df2.index.tolist(), 'B'] = value

df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe. df2.index.tolist()返回df2中所有條目的索引,然後將這些索引用於設置原始數據幀中的B列。


#6樓

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy. 爲了消除任何疑問,我的解決方案是製作切片的深層副本,而不是常規副本。 This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation - especially if the copy occurs in a loop like it did for me, etc...) 根據您的上下文,這可能不適用(內存限制/切片的大小,潛在的性能下降-尤其是如果複製像對我一樣在一個循環中發生,等等。)

To be clear, here is the warning I received: 需要明確的是,這是我收到的警告:

/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Illustration 插圖

I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. 我懷疑是否由於我將一列放在切片的副本上而引發警告。 While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice. 雖然從技術上講,它不是在切片副本中嘗試設置值,但是這仍然是切片副本的修改。 Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning. 以下是我爲確認懷疑而採取的(簡化)步驟,希望它能對那些試圖瞭解警告的人有所幫助。

Example 1: dropping a column on the original affects the copy 示例1:在原件上放置一列會影響複印

We knew that already but this is a healthy reminder. 我們已經知道了,但這是健康的提醒。 This is NOT what the warning is about. 不是警告是關於什麼的。

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123


>> df2 = df1
>> df2

A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    B
0   121
1   122
2   123

It is possible to avoid changes made on df1 to affect df2 可以避免對df1進行更改以影響df2

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    A   B
0   111 121
1   112 122
2   113 123

Example 2: dropping a column on the copy may affect the original 示例2:在副本上放置一列可能會影響原始

This actually illustrates the warning. 這實際上說明了警告。

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> df2 = df1
>> df2

    A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1

B
0   121
1   122
2   123

It is possible to avoid changes made on df2 to affect df1 可以避免對df2進行更改以影響df1

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2

A   B
0   111 121
1   112 122
2   113 123

>> df2.drop('A', axis=1, inplace=True)
>> df1

A   B
0   111 121
1   112 122
2   113 123

Cheers! 乾杯!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章