如何檢查Pandas DataFrame中的任何值是否爲NaN

本文翻譯自:How to check if any value is NaN in a Pandas DataFrame

In Python Pandas, what's the best way to check whether a DataFrame has one (or more) NaN values? 在Python Pandas中,檢查DataFrame是否具有一個(或多個)NaN值的最佳方法是什麼?

I know about the function pd.isnan , but this returns a DataFrame of booleans for each element. 我知道函數pd.isnan ,但是這會爲每個元素返回一個布爾數據框架。 This post right here doesn't exactly answer my question either. 這篇文章也沒有完全回答我的問題。


#1樓

參考:https://stackoom.com/question/1zuA4/如何檢查Pandas-DataFrame中的任何值是否爲NaN


#2樓

df.isnull().any().any()應該這樣做。


#3樓

You have a couple of options. 你有幾個選擇。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

Now the data frame looks something like this: 現在數據框看起來像這樣:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • Option 1 : df.isnull().any().any() - This returns a boolean value 選項1df.isnull().any().any() - 返回一個布爾值

You know of the isnull() which would return a dataframe like this: 你知道isnull()會返回一個像這樣的數據幀:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

If you make it df.isnull().any() , you can find just the columns that have NaN values: 如果你將它df.isnull().any() ,你只能找到具有NaN值的列:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

One more .any() will tell you if any of the above are True 還有一個.any()會告訴你上面的任何一個是否爲True

> df.isnull().any().any()
True
  • Option 2 : df.isnull().sum().sum() - This returns an integer of the total number of NaN values: 選項2df.isnull().sum().sum() - 返回NaN值總數的整數:

This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values: 這與.any().any()操作方式相同,首先給出一列中NaN值的總和,然後是這些值的總和:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

Finally, to get the total number of NaN values in the DataFrame: 最後,要獲取DataFrame中NaN值的總數:

df.isnull().sum().sum()
5

#4樓

jwilner 's response is spot on. jwilner的反應很明顯。 I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. 我正在探索是否有更快的選擇,因爲根據我的經驗,求平面陣列(奇怪地)比計數更快。 This code seems faster: 這段代碼似乎更快:

df.isnull().values.any()

For example: 例如:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum() is a bit slower, but of course, has additional information -- the number of NaNs . df.isnull().sum().sum()是有點慢,但是當然有附加信息-的數目NaNs


#5樓

Depending on the type of data you're dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False. 根據您正在處理的數據類型,您還可以通過將dropna設置爲False來獲取執行EDA時每列的值計數。

for col in df:
   print df[col].value_counts(dropna=False)

Works well for categorical variables, not so much when you have many unique values. 適用於分類變量,而不是在有許多唯一值時。


#6樓

If you need to know how many rows there are with "one or more NaN s": 如果您需要知道“一個或多個NaN ”有多少行:

df.isnull().T.any().T.sum()

Or if you need to pull out these rows and examine them: 或者,如果您需要提取這些行並檢查它們:

nan_rows = df[df.isnull().T.any().T]
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章