騷操作，用SQL方式的去玩Pandas

Pandas是一個非常方便的數據處理、數據分析的類庫，在 人人都是數據分析師，人人都能玩轉Pandas 這篇文章中，我將Pandas進行了一個系統的梳理。

但不可否認的是，不是所有的程序員都會Python，也不是所有的Pythoner都會使用Pandas。

不過好消息是，藉助於pandassql,你可以使用SQL來操作DataFrame。

# 導入相關庫
import numpy as np
import pandas as pd

from pandasql import sqldf, load_meat, load_births

基礎

pandasql 中的主要函數是 sqldf，它接收兩個參數：

一個sql查詢語句
一組會話/環境變量（locals() 或 globals()）

爲了方便起見，我們可以定義一個函數來方便我們調用。

pysqldf = lambda sql: sqldf(sql, globals())

接下來我們導入一些數據。

meat = load_meat()
meat.head()

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1944-01-01	751.0	85.0	1280.0	89.0	NaN	NaN	NaN
1	1944-02-01	713.0	77.0	1169.0	72.0	NaN	NaN	NaN
2	1944-03-01	741.0	90.0	1128.0	75.0	NaN	NaN	NaN
3	1944-04-01	650.0	89.0	978.0	66.0	NaN	NaN	NaN
4	1944-05-01	681.0	106.0	1029.0	78.0	NaN	NaN	NaN

births = load_births()
births.head()

	date	births
0	1975-01-01	265775
1	1975-02-01	241045
2	1975-03-01	268849
3	1975-04-01	247455
4	1975-05-01	254545

查詢

pandassql 使用的語法是 SQLite 的語法。任何 DataFrame 都會被 pandassql 自動檢測到，你可以將它們作爲來查詢。

限定條數

先來看下如何去限定數據條數。這裏來獲取下前兩條數據。

sql = "select * from births limit 2"
pysqldf(sql)

	date	births
0	1975-01-01 00:00:00.000000	265775
1	1975-02-01 00:00:00.000000	241045

除了可以限定從頭開始的前N條數據外，我們還可以設置偏移量。這裏來獲取下從第二行開始的前兩條數據。

sql = "select * from births limit 2 offset 2"
pysqldf(sql)

	date	births
0	1975-03-01 00:00:00.000000	268849
1	1975-04-01 00:00:00.000000	247455

限定字段

既然是SQL，我們當然可以限定查詢時的所需字段了。這裏我們限定只獲取指定的births字段。

sql = "select births from births limit 2"
pysqldf(sql)

	births
0	265775
1	241045

排序

排序功能也是非常常見的，pandassql 完美支持。這裏我們按照 date 降序，births 升序來排。

sql = "select * from births order by date desc, births asc limit 2"
pysqldf(sql)

	date	births
0	2012-12-01 00:00:00.000000	340995
1	2012-11-01 00:00:00.000000	320195

限定查詢條件

我們可以指定 where 來查詢滿足要求的數據。這裏我們篩選出 turkey 不爲空並且 date 在 1974-12-31 之後的數據。

sql = """
select *
from meat
where turkey not null
and date > '1974-12-31'
limit 5
"""
pysqldf(sql)

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1975-01-01 00:00:00.000000	2106.0	59.0	1114.0	36.0	646.2	None	64.9
1	1975-02-01 00:00:00.000000	1845.0	50.0	954.0	31.0	570.2	None	47.1
2	1975-03-01 00:00:00.000000	1891.0	57.0	976.0	35.0	616.6	None	54.4
3	1975-04-01 00:00:00.000000	1895.0	60.0	1100.0	34.0	688.3	None	68.7
4	1975-05-01 00:00:00.000000	1849.0	59.0	934.0	31.0	690.1	None	81.9

聚合

數據分析時，聚合必不可少，pandassql 當然也支持了。這裏我們按照年份來分組，然後對 births 求和、求均值、求最大值以及求最小值。

sql = """
select 
strftime('%Y', date) as year,
sum(births),
avg(births),
max(births),
min(births)
from births
group by
strftime('%Y', date)
limit 3
"""
pysqldf(sql)

	year	sum(births)	avg(births)	max(births)	min(births)
0	1975	3136965	261413.750000	281300	241045
1	1976	6304156	262673.166667	286496	236551
2	1979	3333279	277773.250000	302805	249898

關聯

關聯也是非常常見的操作。這裏我們根據字段 date 作爲條件來關聯 meat 和 births 這兩個DataFrame。

sql = """
select
m.date, b.births, m.beef
from meat m
inner join births b
on m.date = b.date
order by
m.date
limit 5;
"""
pysqldf(sql)

	date	births	beef
0	1975-01-01 00:00:00.000000	265775	2106.0
1	1975-02-01 00:00:00.000000	241045	1845.0
2	1975-03-01 00:00:00.000000	268849	1891.0
3	1975-04-01 00:00:00.000000	247455	1895.0
4	1975-05-01 00:00:00.000000	254545	1849.0

以上是我列舉的一些常用功能，除了這些之外，pandassql 還支持更多的一些操作，這些操作都是基於 SQLite 的語法來完成的，感興趣的話可以自己研究。

騷操作，用SQL方式的去玩Pandas

基礎

查詢

限定條數

限定字段

排序

限定查詢條件

聚合

關聯

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

MAME：在這裏，你可以用Python玩任何街機遊戲

matplotlib祕技：讓可視化圖形動起來

【技術綜述】你真的瞭解圖像分類嗎？

一份機器學習模型離線評估方法的詳細手冊

與你生活密切相關的排序算法的評估指標

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結