聲明:感謝 laserhe, denniswwh , ACMAIN_CHM , vinsonshen 的熱心幫助
首先說明該條sql的功能是查詢集合a不在集合b的數據。
not in的寫法
select add_tb.RUID
from (select distinct RUID
from UserMsg
where SubjectID =12
and CreateTime>'2009-8-14 15:30:00'
and CreateTime<='2009-8-17 16:00:00'
) add_tb
where add_tb.RUID
not in (select distinct RUID
from UserMsg
where SubjectID =12
and CreateTime<'2009-8-14 15:30:00'
)
返回444行記錄用時 0.07sec
explain 結果
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--
----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows |
Extra |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--
----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 452 |
Using where |
| 3 | DEPENDENT SUBQUERY | UserMsg | index_subquery | RUID,SubjectID,CreateTime | RUID | 96 | func | 2 |
Using index; Using where |
| 2 | DERIVED | UserMsg | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1857 |
Using where; Using temporary |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--
----------------------------+
分析:該條查詢速度快原因爲id=2的sql查詢出來的結果比較少,所以id=1sql所以運行速度比較快,id=2的使用了臨時表,不知道這個時候是否使用索引?
其中一種left join
select a.ruid,b.ruid
from(select distinct RUID
from UserMsg
where SubjectID =12
and CreateTime >= '2009-8-14 15:30:00'
and CreateTime<='2009-8-17 16:00:00'
) a left join (
select distinct RUID
from UserMsg
where SubjectID =12 and CreateTime< '2009-8-14 15:30:00'
) b on a.ruid = b.ruid
where b.ruid is null
返回444行記錄用時 0.39sec
explain 結果
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------
-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------
-------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 452 |
|
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 1112 | Using where; Not exists
|
| 3 | DERIVED | UserMsg | ref | SubjectID,CreateTime | SubjectID | 5 | | 6667 | Using where; Using
temporary |
| 2 | DERIVED | UserMsg | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1838 | Using where; Using
temporary |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------
-------+
分析:使用了兩個臨時表,並且兩個臨時表做了笛卡爾積,導致不能使用索引並且數據量很大
另外一種left join
select distinct a.RUID
from UserMsg a
left join UserMsg b
on a.ruid = b.ruid
and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createtime <='2009-8-17 16:00:00'
and b.ruid is null;
返回444行記錄用時 0.07sec
explain 結果
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------
--------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------
--------------------+
| 1 | SIMPLE | a | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1839 | Using where;
Using temporary |
| 1 | SIMPLE | b | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using where;
Not exists; Distinct |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------
--------------------+
分析:兩次查詢都是用上了索引,並且查詢時同時進行的,所以查詢效率應該很高
使用not exists的sql
select distinct a.ruid
from UserMsg a
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createTime <='2009-8-17 16:00:00'
and not exists (
select distinct RUID
from UserMsg
where subjectID =12 and createTime < '2009-8-14 15:30:00'
and ruid=a.ruid
)
返回444行記錄用時 0.08sec
explain 結果
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------
------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------
------------------------+
| 1 | PRIMARY | a | range | SubjectID,CreateTime | CreateTime | 9 | NULL | 1839 | Using
where; Using temporary |
| 2 | DEPENDENT SUBQUERY | UserMsg | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using
where |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------
------------------------+
分析:同上基本上是一樣的,只是分解了2個查詢順序執行,查詢效率低於第3個
爲了驗證數據查詢效率,將上述查詢中的subjectID =12的限制條件去掉,結果統計查詢時間如下
0.20s
21.31s
0.25s
0.43s
laserhe幫忙分析問題總結
select a.ruid,b.ruid
from( select distinct RUID
from UserMsg
where CreateTime >= '2009-8-14 15:30:00'
and CreateTime<='2009-8-17 16:00:00'
) a left join UserMsg b
on a.ruid = b.ruid
and b.createTime < '2009-8-14 15:30:00'
where b.ruid is null;
執行時間0.13s
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------
----+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------
----+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 |
|
| 1 | PRIMARY | b | ref | RUID,CreateTime | RUID | 96 | a.RUID | 2 | Using where; Not exists
|
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3553 | Using where; Using
temporary |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------
----+
執行效率類似與not in的效率
數據庫優化的基本原則:讓笛卡爾積發生在儘可能小的集合之間,mysql在join的時候可以直接通過索引來掃描,而嵌入到子查詢裏頭,查詢規
劃器就不曉得用合適的索引了。
一個SQL在數據庫裏是這麼優化的:首先SQL會分析成一堆分析樹,一個樹狀數據結構,然後在這個數據結構裏,查詢規劃器會查找有沒有合適
的索引,然後根據具體情況做一個排列組合,然後計算這個排列組合中的每一種的開銷(類似explain的輸出的計算機可讀版本),然後比較裏
面開銷最小的,選取並執行之。那麼:
explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'
and CreateTime<='2009-8-17 16:00:00' ) a left join UserMsg b on a.ruid = b.ruid and b.createTime < '2009-8-14 15:30:00'
where b.ruid is null;
和
explain select add_tb.RUID
-> from (select distinct RUID
-> from UserMsg
-> where CreateTime>'2009-8-14 15:30:00'
-> and CreateTime<='2009-8-17 16:00:00'
-> ) add_tb
-> where add_tb.RUID
-> not in (select distinct RUID
-> from UserMsg
-> where CreateTime<'2009-8-14 15:30:00'
-> );
explain
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------
------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------
------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 | Using where
|
| 3 | DEPENDENT SUBQUERY | UserMsg | index_subquery | RUID,CreateTime | RUID | 96 | func | 2 | Using index;
Using where |
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3509 | Using where;
Using temporary |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------
------------------+
開銷是完全一樣的,開銷可以從 rows 那個字段得出(基本上是rows那個字段各個行的數值的乘積,也就是笛卡爾積)
但是呢:下面這個:
explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'
and CreateTime<='2009-8-17 16:00:00' ) a left join ( select distinct RUID from UserMsg where createTime < '2009-8-14
15:30:00' ) b on a.ruid = b.ruid where b.ruid is null;
執行時間21.31s
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------
-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------
-+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1248 |
|
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 30308 | Using where; Not exists
|
| 3 | DERIVED | UserMsg | ALL | CreateTime | NULL | NULL | NULL | 69366 | Using where; Using temporary
|
| 2 | DERIVED | UserMsg | range | CreateTime | CreateTime | 9 | NULL | 3510 | Using where; Using temporary
|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------
-+
我就有些不明白
爲何是四行
並且中間兩行巨大無比
按理說
查詢規劃器應該能把這個查詢優化得跟前面的兩個一樣的
(至少在我熟悉的pgsql數據庫裏我有信心是一樣的)
但mysql裏頭不是
所以我感覺查詢規劃器裏頭可能還是糙了點
我前面說過優化的基本原則就是,讓笛卡爾積發生在儘可能小的集合之間
那麼上面最後一種寫法至少沒有違反這個原則
雖然b 表因爲符合條件的非常多,基本上不會用索引
但是並不應該妨礙查詢優化器看到外面的join on條件,從而和前面兩個SQL一樣,選取主鍵進行join
不過我前面說過查詢規劃器的作用
理論上來講
遍歷一遍所有可能,計算一下開銷
是合理的
我感覺這裏最後一種寫法沒有遍歷完整所有可能
可能的原因是子查詢的實現還是比較簡單?
子查詢對數據庫的確是個挑戰
因爲基本都是遞歸的東西
所以在這個環節有點毛病並不奇怪
其實你仔細想想,最後一種寫法無非是我們第一種寫法的一個變種,關鍵在表b的where 條件放在哪裏
放在裏面,就不會用索引去join
放在外面就會
這個本身就是排列組合的一個可能