SQL學習(全文本搜索)

理解全文搜索

前面我們已經瞭解過基於LIKE關鍵字的搜索,它利用通配操作符匹配文本。使用LIKE,能夠查找包含特殊值或部分值得行。使用正則表達式,可以編寫查找所需行得非常浮渣得匹配模式。

雖然這些搜索機制非常有用,但存在幾個重要的限制:

  • 性能——通配符和正則表達式匹配通常要求MySQL嘗試匹配表中所有行(而且這些搜索極少使用表索引)。因此,由於被搜索行數不斷增加,這些搜索可能非常耗時。
  • 明確控制——使用通配符和正則表達式匹配,很難(而且並不總是能)明確地控制匹配什麼和不匹配什麼。例如,指定一個詞必須匹配,一個詞必須不匹配,而一個詞僅在第一個詞確實匹配得情況下纔可以匹配或者纔可以不匹配。
  • 智能化的結果——雖然基於通配符和正則表達式的搜索提供了非常靈活的搜索,但它們都不能提供一種智能化的選擇結果的方法。例如,一個特殊詞的匹配將會返回包含該詞的所有行,而不區分包含單個匹配的行和包含多個匹配的行。

所有這些限制以及更多的限制都可以通過全文本搜索來解決。在使用全文本搜索時,MySQL不需要分別查看每個行,不需要分別分析和處理每個詞。MySQL創建指定列中各詞的一個索引,搜索可以針對這些詞進行。這樣,MySQL可以快速有效的決定哪些詞匹配,哪些詞不匹配,它們的頻率,等等。

使用全文本搜索

爲了進行全文本搜索,必須索引被搜索的列,而且要隨着數據的改變不斷地重新索引。在對錶進行適當設計後,MySQL會自動進行所有的索引和重新索引。

在索引之後,SELECTMatch()Against()一起使用以實際執行搜索。

啓用全文本搜索支持

一般在創建表時啓用全文本搜索支持。CREATE TABLE語句接受FULL TEXT子句,它給出被索引的一個逗號分隔的列表。

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('rabbit');
+----------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now.                         |
| Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

!傳遞給Match()的值必須與FULLTEXT()定義中的相同。如果指定多個列,則必須列出他們。
!搜索不區分大小寫,除非使用BINARY方式。

mysql> SELECT note_text FROM productnotes WHERE note_text LIKE "%rabbit%";
+----------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now.                         |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

全文本搜索返回以文本匹配的良好程度排序的數據。兩個行都包含詞rabbit,但包含詞rabbit作爲第3個詞的行的等級比作爲第20個詞的行高。

使用查詢擴展

查詢擴展用來設法放寬所返回的全文本搜索結果的範圍。

用例:你想找出所有提到anvils的註釋,只有一個註釋包含詞anvils,但你還想找出可能與你的搜索有關的其他所有行,即使它們不包含詞anvils。

  • 首先,進行一個基本的全文本搜索,找出與搜索條件匹配的所有行
  • 其次,MySQL檢查這些匹配行並選擇所有游泳的詞
  • 再其次,MySQL再次進行全文本搜索,這次不僅使用原來的條件,而且還使用所有有用的詞。

利用查詢擴展,能找出可能相關的結果,即使它們並不精確包含所查找的詞。

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('anvils');
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('anvils' WITH QUERY EXPANSION);
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
| Customer complaint:
Sticks not individually wrapped, too easy to mistakenly detonate all at once.
Recommend individual wrapping.                         |
| Customer complaint:
Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead.  |
| Please note that no returns will be accepted if safe opened using explosives.                                                                            |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now.                                                             |
| Customer complaint:
Circular hole in safe floor can apparently be easily cut with handsaw.                                                               |
| Matches not included, recommend purchase of matches or detonator (item DTNTR).                                                                           |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.01 sec)

使用擴展查查詢後,第一行包含詞anvils,因此等級最高。第二行與anvils無關,但因爲它包含第一行中的兩個詞(customer和recommend),所以也被檢索出來。第3行也包含這兩個相同的詞,但它們在文本中的位置更靠後且分開得更遠。因此也包含這一行,但等級爲第三。

布爾文本搜索

MySQL支持全文本搜索得另外一種形式,稱爲布爾方式。
布爾方式使用細節如下:

  • 要匹配的詞
  • 要排斥的詞
  • 排列提示
  • 表達式分組
  • 另外一些內容

!即使沒有FULLTEXT索引也可以引用布爾表達式,但這是一種非常緩慢的操作(其性能將隨着數據量的增加而降低)

1.檢索包含heavy詞的所有行

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('heavy' IN BOOLEAN MODE);
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes.                                     |
| Customer complaint:
Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.01 sec)

2.檢索包含heavy但不包含以rope開始的詞的行

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('heavy -rope*' IN BOOLEAN MODE);
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| Customer complaint:
Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

3.檢索包含詞rabbit和詞bait的行

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('+rabbit +bait' IN BOOLEAN MODE);
+----------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

4.檢索至少包含rabbit和bait一個詞的行

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ('rabbit bait' IN BOOLEAN MODE);
+----------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now.                         |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

5.檢索包含短語rabbit bait的行

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ("rabbit bait" IN BOOLEAN MODE);
+----------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now.                         |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

6.匹配rabbit和carrot,提高前者的等級降低後者的等級。

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against (">rabbit <carrot" IN BOOLEAN MODE);
+----------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now.                         |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

7.匹配safe和combination, 降低後者的等級

mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against ("+safe +(<combination)" IN BOOLEAN MODE);
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Safe is combination locked, combination not provided with safe.
This is rarely a problem as safes are typically blown up or dropped by customers. |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

全文搜索布爾操作符
在這裏插入圖片描述
在布爾方式中,不按等級值降序排序返回的行。

全文本搜索使用說明

  • 在索引全文本數據時,短詞被忽略且從索引中排除。短詞定義爲那血具有3個或3個以下字符的詞(如果需要,這個數目可以更改)
  • MySQL自帶了一個內建的非用詞(stopword)列表,這些詞在索引全文本數據時總是被忽略。如果需要,可以覆蓋整個列表。
  • 許多詞出現的頻率很高,搜索它們沒有用處(返回太多結果)。因此,MySQL歸定如果一個詞出現在50%以上的行中,則將它作爲一個非用詞忽略。50%規則不用於IN BOOLEAN MODE。
  • 如果表中的行數少於3行,則全文本搜索不返回結果(因爲每個詞或者不出現,或者出現在50%的行中)。
  • 忽略詞中的單引號。例如,don’t索引爲dont
  • 不具有分隔符(腦闊日語和漢語)的語言不能恰當的返回全文搜索的結果
  • 僅在MyISAM數據庫引擎中支持全文本搜索。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章