文章目錄

Trick100：加載大數據

有的時候加載大數據的時候，不需要加載全部，而是僅僅用百分之一進行框架的構建和測試。

# 加載全部數據
df = pd.read_csv("../input/us-accidents/US_Accidents_Dec19.csv")
print("The shape of the df is {}".format(df.shape))
def df
# 加載百分之一的數據
df = pd.read_csv("../input/us-accidents/US_Accidents_Dec19.csv", skiprows = lambda x: x>0 and np.random.rand() > 0.01)
print("The shape of the df is {}. It has been reduced 10 times!".format(df.shape))

skiprows中，x是整數的索引，x>0保證header不會被跳過，np.random().rand()>0.01說明跳過百分之99的數據

Trick99：Unnamed:0

讀取數據的時候避免Unnamed：0 列

d = {\
"zip_code": [12345, 56789, 101112, 131415],
"factory": [100, 400, 500, 600],
"warehouse": [200, 300, 400, 500],
"retail": [1, 2, 3, 4]
}
df = pd.DataFrame(d)
df
# save to csv
df.to_csv("trick99data.csv")

先保存一個df

df = pd.read_csv("trick99data.csv")
df
df = pd.read_csv("trick99data.csv", index_col=0)
# or 
# df = pd.read_csv("trick99data.csv", index = False)
df

我們可以發現，沒有index_col=0或者index_col=False的時候，會有一列名爲Unnamed:0的列，雖然我們後續可以使用drop直接刪去，但是這樣顯得更專業不是？哈哈

trick 98:一個列很多的DF轉化成行很多的

d = {\
"zip_code": [12345, 56789, 101112, 131415],
"factory": [100, 400, 500, 600],
"warehouse": [200, 300, 400, 500],
"retail": [1, 2, 3, 4]
}
df = pd.DataFrame(d)
df
df = df.melt(id_vars = "zip_code", var_name = "location_type", value_name = "distance")
df

主要是使用了pd.melt函數

id_vars是保持不變的列
value_vars是要變的列，如果沒有就是全部列都轉化
var_names和value_name就是名字
我們可以發現是把（列的名字，列的值）轉化爲一行（val_names,value_name）

trick97：把年和那年第幾天轉化爲具體日期

d = {\
"year": [2019, 2019, 2020],
"day_of_year": [350, 365, 1]
}
df = pd.DataFrame(d)
df
df["combined"] = df["year"]*1000 + df["day_of_year"]
df
df["date"] = pd.to_datetime(df["combined"], format = "%Y%j")
df

format:%Y%j
常見的還有%Y%m%D %H:%M:%S

trick 96:pandas作交互式圖標

print(pd.__version__)
# Pandas version 0.25 or higher requiered and you need hvplot
# 這個不是交互的
df.plot(kind = "scatter", x = "spirit_servings", y = "wine_servings")

!pip install hvplot
pd.options.plotting.backend = "hvplot"
df.plot(kind = "scatter", x = "spirit_servings", y = "wine_servings", c = "continent")

trick 95:計算缺失值的數量

d = {\
"col1": [2019, 2019, 2020],
"col2": [350, 365, 1],
"col3": [np.nan, 365, None]
}
df = pd.DataFrame(d)
df
# Solution 1
df.isnull().sum()
# Solution 2
df.isna().sum()
# Solution 3
df.isna().any()
# Solution 4:
df.isna().any(axis = None)

df.isnull()和isna()的區別好像不大。

Trick 94：修正格式來節約內存

df = pd.read_csv("../input/titanic/train.csv", usecols = ["Pclass", "Sex", "Parch", "Cabin"])
df
# let's see how much our df occupies in memory
df.memory_usage(deep = True)
df.dtypes
# convert to smaller datatypes
df = df.astype({"Pclass":"int8",
                "Sex":"category", 
                "Parch": "Sparse[int]", # most values are 0
                "Cabin":"Sparse[str]"}) # most values are NaN
df.memory_usage(deep = True)
df.dtypes

trick 93:通過頻率把頻率低的特徵轉化爲other

d = {"genre": ["A", "A", "A", "A", "A", "B", "B", "C", "D", "E", "F"]}
df = pd.DataFrame(d)
df
# Step 1: count the frequencies
frequencies = df["genre"].value_counts(normalize = True)
frequencies
# Step 2: establish your threshold and filter the smaller categories
threshold = 0.1
small_categories = frequencies[frequencies < threshold].index
small_categories
# Step 3: replace the values
df["genre"] = df["genre"].replace(small_categories, "Other")
df["genre"].value_counts(normalize = True)

trick 92:正則表達式清理混合格式的特徵列

d = {"customer": ["A", "B", "C", "D"], "sales":[1100, 950.75, "$400", "$1250.35"]}
df = pd.DataFrame(d)
df
# Step 1: check the data types
df["sales"].apply(type)
# Step 2: use regex
df["sales"] = df["sales"].replace("[$,]", "", regex = True).astype("float")
df
df["sales"].apply(type)

正則表達式中[abc]可以匹配一個字符串中的a,b,c.
正則表達式查詢表

trick 91:修改df的列順序

d = {"A":[15, 20], "B":[20, 25], "C":[30 ,40], "D":[50, 60]}
df = pd.DataFrame(d)
df
# Using insert
df.insert(3, "C2", df["C"]*2)
df
# 第二種方法，先插入，然後改順序
df["C3"] = df["C"]*3 # create a new columns, it will be at the end
columns = df.columns.to_list() # create a list with all columns
location = 4 # specify the location where you want your new column
columns = columns[:location] + ["C3"] + columns[location:-1] # reaarange the list
df = df[columns] # create te dataframe in with the order of columns you like
df

忽逢桃林

發佈了76 篇原創文章 · 獲贊 8 · 訪問量 7381

私信關注

python pandas庫操作的一百個技巧新手必看學會你就是pandas大佬

文章目錄

Trick100：加載大數據

Trick99：Unnamed:0

trick 98:一個列很多的DF轉化成行很多的

trick97：把年和那年第幾天轉化爲具體日期

trick 96:pandas作交互式圖標

trick 95:計算缺失值的數量

Trick 94：修正格式來節約內存

trick 93:通過頻率把頻率低的特徵轉化爲other

trick 92:正則表達式清理混合格式的特徵列

trick 91:修改df的列順序

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

5分鐘就能學會的簡單結構 | MLP-Mixer: An all-MLP Architecture for Vision | CVPR2021

域遷移DA | Learning From Synthetic Data: Addressing Domain Shift for Se | CVPR2018

光流 | flownet | CVPR2015 | 論文+pytorch代碼

醫學圖像 | DualGAN與兒科超聲心動圖分割 | MICCAI

圖像匹配 | NCC 歸一化互相關損失 | 代碼 + 講解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python pandas庫操作的一百個技巧 新手必看 學會你就是pandas大佬

文章目錄

Trick100：加載大數據

Trick99：Unnamed:0

trick 98:一個列很多的DF轉化成行很多的

trick97：把年和那年第幾天轉化爲具體日期

trick 96:pandas作交互式圖標

trick 95:計算缺失值的數量

Trick 94：修正格式來節約內存

trick 93:通過頻率把頻率低的特徵轉化爲other

trick 92:正則表達式清理混合格式的特徵列

trick 91:修改df的列順序

python pandas庫操作的一百個技巧新手必看學會你就是pandas大佬