前提要述:參考書籍《MySQL必知必會》
14.1 全文本搜索
要了解全文本搜索,就要先了解引擎,也就是我們在創建表時,會在最後指定一個ENGINE值,即引擎類型。下面是3種常見的引擎類型:
- InnoBD是一個可靠的事務處理引擎,它不支持全文本搜索,MySQL 5.6以後就可以把全文本搜索用在InnoDB表引擎中了 ;
- MEMORY在功能等同於MyISAM,但由於數據存儲在內存(不是磁盤)中,速度很快(特別適合於臨時表)。
- MyISAM是一個性能極高的引擎,它支持全文本搜索,但不支持事務處理。
可以看到,並非所有的引擎都支持全文本搜索。所以要使用全文搜索,必須指定ENGINE=MyISAM。
注意:MySQL 5.6以後就可以把全文本搜索用在InnoDB表引擎中了,但是現在是基於《MySQL必知必會》的書。
在前面也學了幾個高級查詢:LIKE關鍵字,利用通配符匹配文本;正則表達式,可編寫更復雜的匹配模式。
而這些搜索機制存儲幾個重要的限制:
- 性能:通配符和正則表達式匹配通常要求MySQL嘗試匹配表中的所有行(而且這些搜索極少使用表索引)。因此,由於被搜索的行不斷增加,這些搜索可能非常耗時。
- 明確控制:使用通配符和正則表達式匹配,很難(並且不總是)明確地控制匹配什麼和不匹配什麼。例如:指定一個詞必須匹配,一個詞必須不匹配,而一個詞僅在第一個詞確實匹配的情況下才可以匹配或者纔可以不匹配。
- 智能化的結果:雖然通配符和正則表達式的搜索提供了非常靈活的搜索,但它們都不能提供一種智能化的選擇結果的方法。例如,一個特殊詞的搜索將會返回包含該詞的所有行,而不區分包含單個匹配的行和包含多個匹配的行。類似,一個特殊詞的搜索將不會找出不包含該詞但包含其他相關詞的行。
所以,這些限制或者更多的其他限制就可用全文本搜索來解決。在使用全文本搜索時,MySQL不需要分別查看每個行,不需要分別分析和處理每個詞。MySQL創建指定列中各詞的一個索引,搜索可以針對這些詞進行。這些,MySQL可用快速有效地決定哪些詞匹配,哪些詞不匹配等等。
14.1.1 啓動全文本搜索
爲了進行全文本搜索,必須索引被搜索的列,而且要隨着數據的改變不斷地重新索引。這就需要在設計表時設置好,然後MySQL會自動進行所有的索引和重新索引。
#######################################
# 作用:存儲與特定產品有關的註釋 #
# 但是並不是所有的產品都有註釋 #
# note_id 唯一註釋ID #
# prod_id 產品ID(對應products表中的prod_id) #
# note_date 增加註釋的日期 #
# note_test 註釋文本 #
#######################################
CREATE TABLE productnotes
(
note_id int NOT NULL AUTO_INCREMENT,
prod_id char(10) NOT NULL,
note_date datetime NOT NULL,
note_text text NULL ,
PRIMARY KEY(note_id),
FULLTEXT(note_text)
) ENGINE=MyISAM;
然後插入數據:
# productnotes
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(101, 'TNT2', '2005-08-17',
'Customer complaint:
Sticks not individually wrapped, too easy to mistakenly detonate all at once.
Recommend individual wrapping.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(102, 'OL1', '2005-08-18',
'Can shipped full, refills not available.
Need to order new can if refill needed.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(103, 'SAFE', '2005-08-18',
'Safe is combination locked, combination not provided with safe.
This is rarely a problem as safes are typically blown up or dropped by customers.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(104, 'FC', '2005-08-19',
'Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(105, 'TNT2', '2005-08-20',
'Included fuses are short and have been known to detonate too quickly for some customers.
Longer fuses are available (item FU1) and should be recommended.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(106, 'TNT2', '2005-08-22',
'Matches not included, recommend purchase of matches or detonator (item DTNTR).'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(107, 'SAFE', '2005-08-23',
'Please note that no returns will be accepted if safe opened using explosives.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(108, 'ANV01', '2005-08-25',
'Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(109, 'ANV03', '2005-09-01',
'Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(110, 'FC', '2005-09-01',
'Customer complaint: rabbit has been able to detect trap, food apparently less effective now.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(111, 'SLING', '2005-09-02',
'Shipped unassembled, requires common tools (including oversized hammer).'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(112, 'SAFE', '2005-09-02',
'Customer complaint:
Circular hole in safe floor can apparently be easily cut with handsaw.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(113, 'ANV01', '2005-09-05',
'Customer complaint:
Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(114, 'SAFE', '2005-09-07',
'Call from individual trapped in safe plummeting to the ground, suggests an escape hatch be added.
Comment forwarded to vendor.'
);
注意上面的FULLTEXT(note_text)和ENGINE=MyISAM。
- FULLTEXT():給出被索引的列,可指定多個列。
- ENGINE=MyISAM:指定MyISAM引擎類型。
所以可得:上面創建表時,的FULLTEXT(note_text)指定note_text的列爲索引列,即爲了進行全文本搜索的列。
在定義後,MySQL自動維護該索引,在增加、更新、刪除行時索引隨着自動更新。
FULLTEXT也可以在創建表後添加,使用ALTER TABLE來添加。
注意:不要再導入數據時使用FULLTEXT,更新索引要花更多時間。如果正在導入數據到一個新表,此時不應該啓用FULLTEXT索引。應該先導入所有數據,然後再修改表,定義FULLTEXT。這樣有助於更快地導入數據(而且使索引數據的總時間小於在導入每行時分別進行索引所需的總時間)
14.1.2 使用全文本搜索
在索引後,使用兩個函數MATCH()和AGAINST()執行全文本搜索,其中MATCH()指定被搜索的列,AGAINST()指定要使用的搜索表達式。
舉個栗子:
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('rabbit');
輸出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
| Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+---------------------------------------------------------------------------------------------------------------------+
傳遞給MATCH()的值必須與FULLTEXT()定義中的相同。如果指定多個列,則必須列出它們(而且次序正確)。
搜索不區分大小寫,除非使用BINARY關鍵字。
上面的例子也可以使用LIKE子句來完成:
SELECT note_text
FROM productnotes
WHERE note_text LIKE '%rabbit%';
輸出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+---------------------------------------------------------------------------------------------------------------------+
上面的例子都沒有包含ORDER BY子句,使用LIKE子句以不特別有用的順序返回數據。而全文本搜索返回以文本匹配的良好程序排序的數據。在全文本搜索的一個重要部分就是對結果排序,具有較高等級的行先返回。(就像上面的例子,兩行都包含詞rabbit,但是包含詞rabbit作爲第3個詞的行的等級比作爲第20各詞的行高)
可以演示一下全文本搜索匹配詞rabbit的優先級:
SELECT note_text,
MATCH(note_text) AGAINST('rabbit') AS rank
FROM productnotes;
輸出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| note_text | rank |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| Customer complaint:Sticks not individually wrapped, too easy to mistakenly detonate all at once.Recommend individual wrapping. | 0 |
| Can shipped full, refills not available.Need to order new can if refill needed. | 0 |
| Safe is combination locked, combination not provided with safe.This is rarely a problem as safes are typically blown up or dropped by customers. | 0 |
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. | 1.5905543565750122 |
| Included fuses are short and have been known to detonate too quickly for some customers.Longer fuses are available (item FU1) and should be recommended. | 0 |
| Matches not included, recommend purchase of matches or detonator (item DTNTR). | 0 |
| Please note that no returns will be accepted if safe opened using explosives. | 0 |
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. | 0 |
| Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes. | 0 |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. | 1.6408053636550903 |
| Shipped unassembled, requires common tools (including oversized hammer). | 0 |
| Customer complaint:Circular hole in safe floor can apparently be easily cut with handsaw. | 0 |
| Customer complaint:Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. | 0 |
| Call from individual trapped in safe plummeting to the ground, suggests an escape hatch be added.Comment forwarded to vendor. | 0 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
14 rows in set (0.09 sec)
此演示可以看到rank列是全文本搜索計算出的等級值。等級是由MySQL根據行中詞的數目、唯一詞的數目、整個索引中詞的總數以及包含該詞的行的數目計算出來。所以,上面中,不包含rabbit的行等級爲0,包含詞rabbit的兩個行都有一個等級值,文本中詞靠前的行的等級值比詞靠後的行的等級值高。
如果是指定多個搜索項,則包含多數匹配詞的那些行將具有比包含較少詞的那些行高的等級值。
14.1.3 使用查詢擴展
查詢擴展是放寬所返回的全文本搜索結果的範圍。比如,想找出anvils的註釋,只有一個註釋包含詞anvils,但有時還想找出可能與該搜索有關的其他行,即使它們不包含anvils。
這就是查詢擴展。在使用查詢擴展時,MySQL對數據和索引進行兩遍掃描來完成搜索:
- 首先,進行一個基本的全文本搜索,找出與搜索條件匹配的所有行;
- 其次,MySQL檢查這些匹配行並選擇所有有用的詞;
- 最後,MySQL再次進行全文本搜索,這次不僅使用原來的條件,而且還使用所有有用的詞。
查詢擴展是MySQL版本4.1.1引入。
下面舉個例子:先進行一個簡單的全文本搜索,沒有查詢擴展:
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('anvils');
輸出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)
下面使用查詢擴展:
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('anvils' WITH QUERY EXPANSION);
SELECT note_text, MATCH(note_text) AGAINST(‘anvils’ WITH QUERY EXPANSION) as rank
FROM productnotes;
輸出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
| Please note that no returns will be accepted if safe opened using explosives. |
| Customer complaint:Sticks not individually wrapped, too easy to mistakenly detonate all at once.Recommend individual wrapping. |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.04 sec)
解釋:查詢擴展在AGAINST()中使用了WITH QUERY EXPANSION關鍵字。這次返回了3行,第一行是包含詞anvils,因此等級最高。第二行與anvils無關,但是它包含第一行中的兩個詞returns和using。第三行包含了Customer和Recommend兩詞,但是這兩次分開得很遠,所以結果排序靠後。
《MySQL必知必會》返回了7行,我很奇怪,確實數據應該是有7行,其中6行是相關行。保留疑問??
解決:是我把字符序(校對順序)設置成utf8_bin,也就是區分大小寫的原因。
14.1.4 使用布爾查詢
MySQL支持全文本搜索得另外一種形式,稱爲布爾方式(boolean mode)。以布爾方式,可以提供關於如下內容的細節:
- 要匹配的詞;
- 要排斥的詞(如果某行包含這個詞,則不返回該行,即使它包含其他指定的詞也是如此);
- 排列提示(指定某些詞比其他詞更重要,更重要的詞等級更高);
- 表達式分組;
即使沒有FULLTEXT索引,布爾方式也是可以使用的,但是這是一種非常緩慢的操作(其性能將隨着數據量的增加而降低)。
使用布爾方式,需要學習以下的布爾操作符:
布爾操作符 | 說明 |
---|---|
+ | 包含,詞必須存在 |
- | 排除,詞必須不出現 |
> | 包含,而且增加等級值 |
< | 包含,且減少等機值 |
() | 把詞組成子表達式(允許這些子表達式作爲一個組被包含、排除、排列等) |
~ | 取消一個詞的排序值 |
* | 詞尾的通配符 |
“” | 定義一個短語(與單個詞的列表不一樣,它匹配整個短語以便包含或排除這個短語) |
並且使用布爾方式,得使用IN BOOLEAN MODE關鍵字。
下面舉些例子:
- 搜索匹配包含詞rabbit和bait的行。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('+rabbit +bait' IN BOOLEAN MODE);
輸出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+---------------------------------------------------------------------------------------------------------------------+
1 row in set (0.07 sec)
- 假設沒有指定操作符,這個搜索匹配包含rabbit和bait中的至少一個詞的行。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('rabbit bait' IN BOOLEAN MODE);
輸出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+---------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.06 sec)
- 搜索匹配短語rabbit bait 而不是匹配兩個詞rabbit和bait。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('"rabbit bait"' IN BOOLEAN MODE);
輸出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+---------------------------------------------------------------------------------------------------------------------+
1 row in set (0.08 sec)
- 匹配rabbit和carrot,增加前者的等級,降低後者的等級。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('>rabbit <carrot' IN BOOLEAN MODE);
輸出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+---------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.06 sec)
- 搜索匹配詞safe和combination,降低後者的等級。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('+safe +(<combination)' IN BOOLEAN MODE);
輸出:
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Safe is combination locked, combination not provided with safe.This is rarely a problem as safes are typically blown up or dropped by customers. |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.08 sec)
14.1.5 總結
- 在索引全文本數據時,短詞被忽略且從索引中排除。短詞定義爲那些具有3個或3個以下字符的詞(如果需要,這個數目可以改)。
- MySQL帶有一個內建的非用詞(stopword)列表,這些詞在索引全文本數據時總是被忽略。如果需要,可以覆蓋這個列表(這個得參考MySQL文檔)
- 許多詞出現的頻率很高,搜索它們沒有用處(返回太多的結果)。因此,MySQL規定了一條50%規則,如果一個詞出現在50%以上的行中,則將它作爲一個非用詞忽略。50%規則不用於IN BOOLEAN MODE。
- 忽略詞中的單引號,比如:don’t索引爲dont。
- 不具有詞分隔符(包括日語和漢語)的語言不能恰當地返回全文本搜索結果。
- 使用全文本搜索必須使用引擎:MyISAM。MySQL 5.6以後也可以用在Innodb表引擎中了。
- 僅能再char、varchar、text類型的列上面創建全文索引。
- 注意FULLTEXT索引要在導完數據後再定義FULLTEXT是哪(些)列,否則很耗時。
注意:這裏是MySQL5.0版本的全文本搜索。