1前言
Oracle從7.3開始支持全文檢索,即用戶可以使用Oracle服務器的上下文(ConText)選項完成基於文本的查詢。具體可以採用通配符查找、模糊匹配、相關分類、近似查找、條件加權和詞意擴充等方法。在Oracle8.0.x中稱爲ConText ;在Oracle8i中稱爲interMedia Text ; Oracle9i中稱爲Oracle Text。
本篇主要介紹Oracle Text的基本結構和簡單應用。
Oracle Text是9i標準版和企業版的一部分。Oracle9i將全文檢索功能做爲內置功能提供給用戶,使得用戶在創建數據庫實例時自動安裝全文檢索。
Oracle Text的應用領域有很多:
l 搜索文本:需要快捷有效搜索文本數據的應用程序
l 管理多種文檔:允許搜索各種混和文檔格式的應用程序,包括ord,excel,lotus等
l 從多種數據源中檢索文本:不僅來自Oracle數據庫中的文本數據,而且可以來自Internet和文件系統的文本數據
l 搜索XML應用程序
1.1搜索文本
不使用Oracle text功能,也有很多方法可以在Oracle數據庫中搜索文本.可以使用標準的INSTR函數和LIKE操作符實現.
SELECT *
FROM mytext
WHERE INSTR (thetext, 'Oracle') > 0;
SELECT *
FROM mytext
WHERE thetext LIKE '%Oracle%';
有很多時候使用instr和like是很理想的, 特別是搜索僅跨越很小的表的時候.然而通過這些文本定位的方法將導致全表掃描,對資源來說消耗比較昂貴,而且實現的搜索功能也非常有限.
利用Oracle Text,你可以回答如“在存在單詞‘Oracle’的行同時存在單詞’Corporation’而且兩單詞間距不超過10個單詞的文本‘,’查詢含有單詞’Oracle’或者單詞’ california’的文本,並且將結果按準確度進行排序‘,’含有詞根train的文本‘。以下的sql代碼實現瞭如上功能。我們且不管這些語法是如何使用的。
DROP INDEX index mytext_idx
/
CREATE INDEX mytext_idx
ON mytext( thetext )
INDEXTYPE is CTXSYS.CONTEXT
/
SELECT id
FROM mytext
WHERE contains (thetext, 'near((Oracle,Corporation),10)') > 0
/
SELECT score (1), id
FROM mytext
WHERE contains (thetext, 'Oracle or california', 1) > 0
ORDER BY score (1) DESC
/
SELECT id
FROM mytext
WHERE contains (thetext, '$train') > 0;
1.2設置
首先檢查數據庫中是否有CTXSYS用戶和CTXAPP腳色。如果沒有這個用戶和角色,意味着你的數據庫創建時未安裝intermedia功能。你必須修改數據庫以安裝這項功能。
還可以檢查服務器是否有對PLSExtProc服務的監聽。
lsnrctl status
should give status
LSNRCTL for Solaris: Version
8.1.5.0.0 - Production on 31-MAR-99 18:57:49
(c) Copyright 1998 Oracle Corporation. All rights reserved.
Connecting to
(DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC0)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Solaris: Version 8.1.5.0.0 - Production
Start Date 30-MAR-99 15:53:06
Uptime 1 days 3 hr. 4 min. 42 sec
Trace Level off
Security OFF
SNMP OFF
Listener Parameter File
/private7/Oracle/Oracle_home/network/admin/listener.ora
Listener Log File
/private7/Oracle/Oracle_home/network/log/listener.log
Services Summary...
PLSExtProc has 1 service handler(s)
oco815 has 3 service handler(s)
The command completed successfully
Oracle 是通過所謂的‘外部調用功能’(external procedure)來實現intermedia的。
CREATE USER ctxtest IDENTIFIED BY ctxtest;
GRANT CONNECT, RESOURCE, ctxapp TO ctxtest;
CREATE TABLE quick (
quick_id NUMBER PRIMARY KEY,
text VARCHAR(80));
INSERT INTO quick
(quick_id, text)
VALUES (1, 'The cat sat on the mat');
INSERT INTO quick
(quick_id, text)
VALUES (2, 'The quick brown fox jumped over the lazy dog');
CREATE INDEX quick_text
----------------------------------------------------------
此時如果監聽沒有配置好創建索引將會失敗。
CREATE INDEX quick_text ON quick ( text );
*
ERROR at line 1:
ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
ORA-20000: ConText error:
DRG-50704: Net8 listener is not running or cannot start external procedures
ORA-28575: unable to open RPC connection to external procedure agent
ORA-06512: at "CTXSYS.DRUE", line 122
ORA-06512: at "CTXSYS.TEXTINDEXMETHODS", line 34
ORA-06512: at line 1
Index created.
SQL> SELECT quick_id FROM quick WHERE contains (text, 'cat') > 0;
QUICK_ID
----------
1
SQL> SELECT quick_id FROM quick WHERE contains(text, 'fox') > 0;
QUICK_ID
----------
2
利用Oracle Text對文檔集合進行檢索的時候,你必須先在你的文本列上建立索引。索引將文本打碎分成很多記號(token),這些記號通常是用空格分開的一個個單詞。
Oracle Text應用的實現實際上就是一個數據裝載—> 索引數據—>執行檢索的一個過程。
2.1.1索引類型和限制
建立的Oracle Text索引被稱爲域索引(domain index),包括4種索引類型:
l CONTEXT
l
CTXCAT
l
CTXRULE
l CTXXPATH
依據你的應用程序和文本數據類型你可以任意選擇一種。可以利用Create Index建立這4種索引。下面說一下這4種索引的使用環境。
索引類型
|
描述
|
查詢操作符
|
CONTEXT
|
用於對含有大量連續文本數據進行檢索。支持word、html、xml、text等很多數據格式。支持中文字符集,支持分區索引,唯一支持並行創建索引(Parallel indexing)的索引類型。
對錶進行DML操作後,並不會自動同步索引。需要手工同步索引
|
CONTAINS
|
CTXCAT
|
當使用混合查詢語句的時候可以帶來很好的效率。適合於查詢較小的具有一定結構的文本段。具有事務性,當更新主表的時候自動同步索引。
The
CTXCAT index does not support table and index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.) |
CATSEARCH
|
CTXRULE
|
Use to build a document classification application. You create this index on a table of queries, where each query has a classification.
Single documents (plain text, HTML, or XML) can be classified by using the MATCHES operator.
|
MATCHES
|
CTXXPATH |
Create this index when you need to speed up ExistsNode() queries on an XMLType column.
Can only create this index on XMLType column.
|
|
在以上4種索引中,最常用的就是 CONTEXT索引,使用最通用的CONTAINS操作符進行查詢。本篇主要針對的就是Oracle Text ConText的介紹。
2.1.2權限和臨時表空間
2.2 CONTEXT 索引
2.2.1 CONTEXT 索引的結構
CONTEXT
索引是反向索引(inverted index)。每個記號(token)都映射着包含它自己的文本位置。在索引建立過程中,單詞Cat會包括如下的條目入口:SQL> CREATE TABLE mytext (text VARCHAR2(100));
SQL> INSERT INTO mytext
TOKEN_TEXT TOKEN_COUNT
建立索引的語法如下。索引建立好後可以用ConTains進行查詢。
CREATE INDEX [schema.]index on [schema.]table(column) INDEXTYPE IS ctxsys.context [ONLINE]
LOCAL [(PARTITION [partition] [PARAMETERS('paramstring')]
[, PARTITION [partition] [PARAMETERS('paramstring')]])]
[PARAMETERS(paramstring)] [PARALLEL n] [UNUSABLE];
類別
|
描述
|
Datastore
|
從哪裏得到數據?
|
Filter
|
將數據轉換成文本
|
Lexer
|
正在索引什麼語言?
|
Wordlist
|
應該如何展開莖幹和模糊查詢
|
Storage
|
如何存儲索引
|
Stop List
|
什麼單詞或者主題不被索引?
|
Section Group
|
允許在區段內查詢嗎?如何定義文檔區段。這把文檔轉換成普通文本
|
CREATE INDEX myindex ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT;
1.
文本存儲在數據庫中。可以是CLOB
, BLOB
, BFILE
, VARCHAR2
, or CHAR
類型的文本數據。
2.
文本列語言是數據庫建立時的默認的字符集。
使用數據庫默認的終止目錄
stoplist.stoplist
記錄存在於文本列中但不對其索引的詞。
2.3索引參數
2.3.1DataStore
指明你的文本是如何存儲的。系統默認文檔儲存在數據庫內的文本列(CHAR
,VARCHAR
,VARCHAR2
,BLOB
,CLOB
,BFILE
, orXMLType
)中。DataStore對象在由過濾器處理之前從數據庫中的列摘錄文本。你要索引的文檔可以來自多種數據源。
Datastore Type
|
Use When
|
Data is stored internally in the text column. Each row is indexed as a single document.
|
|
Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one per row.
|
|
Data is stored internally in the text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a master table.
|
|
Data is stored externally in operating system files. Filenames are stored in the text column, one per row.
|
|
Data is stored in a nested table.
|
|
Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) are stored in the text column.
|
|
Documents are synthesized at index time by a user-defined stored procedure.
|
n OnlyCTXSYS
is allowed to create preferences for theMULTI_COLUMN_DATASTORE
type. Any other user who attempts to create aMULTI_COLUMN_DATASTORE
preference receives an error.so it run on ctxsys schema
CREATE TABLE mc(id NUMBER PRIMARY KEY, NAME VARCHAR2(10), address VARCHAR2(80))
/
INSERT INTO mc
VALUES (1, 'John Smith', '123 Main Street biti');
EXEC ctx_ddl.create_preference('mymds', 'MULTI_COLUMN_DATASTORE');
EXEC ctx_ddl.set_attibute('mymds', 'columns', 'name, address');
CREATE INDEX mc_idx ON mc(NAME) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('datastore mymds')
/
SELECT *
FROM mc
WHERE contains (name, 'biti') > 0;
問:如何實現對主/從表的全文檢索?
答:使用類Detail_DataStore。這個類經過設計,供主/從表格使用,其中大量文本存儲在從表表格列中。在進行索引之前把多個從錶行聯接爲一個文檔,使用外來關鍵碼識別行,外鍵關係必須已邏輯的方式存在,但不必作爲數據庫約束條件。注意在不更改主表的情況下更改從表,不會更新索引。解決這個的辦法是更新主列的值,觸發索引的重新構建,或者手工設置和重新構建索引。否則,應該存在未使用的列,來保持SQL語法的完整性。如下面例子中的purchase_order表的line_item_body列。
------------------BEGIN---------------------
SET echo on
DROP TABLE purchase_order;
CREATE TABLE purchase_order
( id NUMBER PRIMARY KEY,
description VARCHAR2(100),
line_item_body CHAR(1)
)
/
DROP TABLE line_item;
CREATE TABLE line_item
( po_id NUMBER,
po_sequence NUMBER,
line_item_detail VARCHAR2(1000)
)
/
INSERT INTO purchase_order
(id, description)
VALUES (1, 'Many Office Items')
/
INSERT INTO line_item
(po_id, po_sequence, line_item_detail)
VALUES (1, 1, 'Paperclips to be used for many reports')
/
INSERT INTO line_item
(po_id, po_sequence, line_item_detail)
VALUES (1, 2, 'Some more Oracle letterhead')
/
INSERT INTO line_item
(po_id, po_sequence, line_item_detail)
VALUES (1, 3, 'Optical mouse')
/
COMMIT ;
BEGIN
ctx_ddl.create_preference ('po_pref', 'DETAIL_DATASTORE');
ctx_ddl.set_attribute ('po_pref', 'detail_table', 'line_item');
ctx_ddl.set_attribute ('po_pref', 'detail_key', 'po_id');
ctx_ddl.set_attribute ('po_pref', 'detail_lineno', 'po_sequence');
ctx_ddl.set_attribute ('po_pref', 'detail_text', 'line_item_detail');
END;
/
DROP INDEX po_index;
CREATE INDEX po_index ON purchase_order( line_item_body )
INDEXTYPE IS ctxsys.CONTEXT
PARAMETERS( 'datastore po_pref' )
/
SELECT id
FROM purchase_order
WHERE contains (line_item_body, 'Oracle') > 0
/
-------------------END----------------------
2.3.2 Filter 過濾
一旦彙編了文檔,它就沿管道傳遞。接下來這個階段是過濾(Filter).如果文檔是一種外來格式,就將它轉換爲可讀取的文本,以便進行索引。默認是NULL_FILTER,它簡單的直接傳遞文檔,不作任何修改。
CREATE INDEX myindex
我們使用null_filter過濾類和ctxsys用戶自帶的 html_section_group區段組類。我們會在後面馬上介紹區段組(Section Groups)的概念。
2.3.2 Section Groups區分組
區分組(Section Groups)是與interMedia一起使用XML的關鍵。這些組處理XML(或者HTML)文檔,輸出兩個數據流,即區段界限和文本內容。默認是NULL_SECTION_GROUP,它簡單的直接傳遞文本,不執行任何修改和處理。HTML_SECTION_GROUP是專門用來處理HTML文檔的。
下面的例子中顯示如何處理HTML文檔。
------------------BEGIN---------------------
SET echo on
DROP TABLE my_html_docs;
CREATE TABLE my_html_docs( id NUMBER PRIMARY KEY, html_text VARCHAR2(4000))
/
INSERT INTO my_html_docs
(id,
html_text)
VALUES (1,
'<html><title>Oracle Technology</title><body>This is about the wonderful marvels of 8i and 9i</body></html>')
/
COMMIT ;
CREATE INDEX my_html_idx ON my_html_docs( html_text )INDEXTYPE IS ctxsys.CONTEXT
/
-- 默認使用NULL_SECTION_GROUP 不對文檔做任何數據流處理
SELECT id
FROM my_html_docs
WHERE contains (html_text, 'Oracle') > 0
-- 可以檢索到區段界限之間的文本
/
DROP INDEX my_html_idx;
SELECT id
WHERE contains (html_text, 'Oracle within title') > 0;
------------------END---------------------
CREATE TABLE employee_xml(
INSERT INTO employee_xml
BEGIN
Also, specify the null_filter, as the Inso filter is not required.
SELECT id
Thus, the following queries will return zero rows.
2.3.3 Storage 類
類型
|
描述
|
BASIC_STORAGE
|
爲CONTEXT索引指定默認的存儲參數
|
屬性
|
屬性值
|
i_table_clause
|
Parameter clause for dr$indexname$I table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The I table is the index data table.
|
k_table_clause
|
Parameter clause for dr$indexname$K table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The K table is the keymap table.
|
r_table_clause
|
Parameter clause for dr$indexname$R table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The R table is the rowid table.
The default clause is: 'LOB(DATA) STORE AS (CACHE)'
|
n_table_clause
|
Parameter clause for dr$indexname$N table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The N table is the negative list table.
|
i_index_clause
|
Parameter clause for dr$indexname$X index creation. Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement. The default clause is: 'COMPRESS 2' which instructs Oracle to compress this index table.
If you choose to override the default, Oracle recommends including COMPRESS 2 in your parameter clause to compress this table, since such compression saves disk space and helps query performance.
|
p_table_clause
|
Parameter clause for the substring index if you have enabled SUBSTRING_INDEX in the BASIC_WORDLIST.
Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement. The P table is an index-organized table so the storage clause you specify must be appropriate to this type of table.
|
CREATE INDEX iowner.idx ON towner.tab(b) INDEXTYPE IS ctxsys.CONTEXT;
EXECUTE ctx_ddl.drop_preference('mystore');
Oracle實現全文檢索,其機制其實很簡單。即通過Oracle專利的詞法分析器(lexer),將文章中所有的表意單元(Oracle 稱爲 term)找出來,記錄在一組 以dr$開頭的表中,同時記下該term出現的位置、次數、hash 值等信息。檢索時,Oracle 從這組表中查找相應的term,並計算其出現頻率,根據某個算法來計算每個文檔的得分(score),即所謂的‘匹配率’。而lexer則是該機制的核心,它決定了全文檢索的效率。Oracle 針對不同的語言提供了不同的 lexer, 而我們通常能用到其中的三個:
BEGIN
CREATE INDEX myindex ON mytable(mycolumn) indextype is ctxsys.context
文本數據量
|
索引數據量(4個表段和1個索引段)
|
6M
|
80M
|
80M
|
900M
|
230M
|
2880M
|
1344M
|
15232M
|
2.3.5 STOP Lists類
Stop List可以含有最多4095個單詞,每個單詞最多64個字符,同時爲英語和其它語言提供了默認列表。
SELECT spw_word FROM DR$STOPWORD;
EXECUTE ctx_ddl.add_stopword('stoppref','的');
SELECT *
SPL_OWNER SPL_NAME SPL_COUNT SPL_TYPE
查看系統默認參數項:
SELECT par_name, par_value FROM ctx_parameters;
設置系統默認參數:
CTX_ADM.SET_PARAMETER(param_name IN VARCHAR2,
3索引維護
CTX_USER_INDEX_ERRORS
;也可以查詢
CTXSYS.
CTX_INDEX_ERRORS查看全部用戶的索引錯誤。
.
查看最近發生的錯誤:
SELECT err_timestamp, err_text清除錯誤視圖:
DELETE FROM ctx_user_index_errors;
DROP INDEX newsindex;
DROP INDEX newsindex FORCE;
ALTER INDEX newsindex REBUILD PARAMETERS('replace lexer my_lexer');
BEGIN
CTX_USER_PENDING
查看相應的改動。例如:SELECT pnd_index_name, pnd_rowid,
該語句的輸出類似如下:
PND_INDEX_NAME PND_ROWID TIMESTAMP
------------------------------ ------------------ --------------------
MYINDEX AAADXnAABAAAS3SAAC 06-oct-1999 15:56:50
對於
CTXCAT
類型的索引來說,
當對基表進行DML操作的時候,Oracle自動維護索引。對文檔的改變馬上反映到索引中。CTXCAT是事務形的索引。HOST ctxsrv -user ctxsys/ctxsys>&/tmp/ctx.log&
當你啓動了CTXSRV服務進程,在後臺的同步請求處理就會象事時一樣,在你提交修改1,2秒後新的數據馬上就被索引了。
默認情況下,如果你不啓動CTXSRV進程,索引不會自動更新除非你手工告訴它們去更新自己。你可以使用 alter index <iname> rebuild parameters ('sync') 更新索引。
ALTER INDEX search_idx REBUILD parameters( 'sync' )
/
Index altered.
9i提供了新的專門用於更新索引的包ctx_ddl.sync_index(…)
ctx_ddl.sync_index(
例如:
SELECT /*+ FIRST_ROWS() */ ID, SCORE(1), TEXT
DOG DOC1 DOC3 DOC5
DOG DOC1 DOC3 DOC5
DOG DOC7
DOG DOC1 DOC3 DOC5
DOG DOC7
DOG DOC9
DOG DOC11
CTX_DDL.OPTIMIZE_INDEX
),使用FULL或者FAST參數都可以降低索引碎片,提高索引效率。BEGIN
BEGIN
SELECT語句中,可以在WHERE指定CONTAINS操作符。還可以指定返回記錄的得分(SCORE)。
得分SCORE是指查詢結果的貼切程度。得分越高表示查詢信息滿意度越高。你可以根據SCORE進行排序。
SELECT score (1), title
FROM news
WHERE contains (text, 'Oracle', 1) > 0
ORDER BY score (1) DESC;
SELECT score (1), title, issue_date
FROM news
WHERE contains (text, 'Oracle', 1) > 0
AND issue_date >= ('01-OCT-97')
ORDER BY score (1) DESC;
SELECT
statement returns all articles that contain the word Oracle that were written on or after October 1, 1997.
操作符
|
符號
|
描述
|
例子表達式
|
AND
|
&
|
Use the AND operator to search for documents that contain at least one occurrence of each of the query terms.
Score returned is the minimum of the operands.
|
'cats AND dogs'
'cats & dogs'
|
OR
|
|
|
Use the OR operator to search for documents that contain at least one occurrence of any of the query terms.
Score returned is the maximum of the operands.
|
'cats | dogs'
'cats OR dogs'
|
NOT
|
~
|
Use the NOT operator to search for documents that contain one query term and not another.
|
To obtain the documents that contain the term animals but not dogs, use the following expression:
'animals ~ dogs'
|
ACCUM
|
,
|
Use the ACCUM operator to search for documents that contain at least one occurrence of any of the query terms. The accumulate operator ranks documents according to the total term weight of a document.
|
The following query returns all documents that contain the terms dogs, cats and puppies giving the highest scores to the documents that contain all three terms:
'dogs, cats, puppies'
|
EQUIV
|
=
|
Use the EQUIV operator to specify an acceptable substitution for a word in a query.
|
The following example returns all documents that contain either the phrase alsatians are big dogs or German shepherds are big dogs:
'German
shepherds=alsatians are
big dogs'
|
在Oracle Text ReferenceRelease 9.2可以查看更多的選項。
5.3如何優化查詢
- estimate the selectivity of the CONTAINS predicate
- estimate the I/O and CPU costs of using the Text index, that is, the cost of processing the CONTAINS predicate using the domain index
- estimate the I/O and CPU costs of each invocation of CONTAINS
Note:Use the FIRST_ROWS(n) hint when you need only the first few hits of a query. When you need the entire result set, do not use this hint as it might result in poor performance.
|
|
6.1
創建一個本地分區索引:
------------------------
BEGIN---------------------------
PROMPT create partitioned table and populate it
CREATE TABLE part_tab (a int, b varchar2(40)) PARTITION BY RANGE(a)
(partition p_tab1 values less than (10),
partition p_tab2 values less than (20),
partition p_tab3 values less than (30));
PROMPT create customer storage preference assigned each partition
PROMPT create partitioned index
CREATE INDEX part_idx on part_tab(b) INDEXTYPE IS CTXSYS.CONTEXT
LOCAL (partition p_idx1 parameters(‘storage mystore1’), partition p_idx2 parameters(‘storage mystore2’), partition p_idx3 parameters(‘storage mystore3’));
-------------------------END-
---------------------------
6.2
並行的創建一個本地分區索引
可以並行的建立分區索引,加快建立索引速度。但是建立索引不能“一步到位“。我們必須先建立一個
unusable
索引,然後利
DBMS_PCLXUTIL.BUILD_PART_INDEX
並行建立索引。
------------------------
BEGIN---------------------------
PROMPT the base table has three partitions.
PROMPT We create a local partitioned unusable index first
CREATE INDEX tdrbip02bx ON tdrbip02b(text)
indextype is ctxsys.context local (partition tdrbip02bx1,
partition tdrbip02bx2,
partition tdrbip02bx3)
unusable;
PROMPT run the DBMS_PCLUTIL.BUILD_PART_INDEX,which builds the 3 partitions in parallel (inter-partition parallelism). Also inside each partition, index creation is done in parallel (intra-partition parallelism) with a parallel degree of 2.
BEGIN
DBMS_PCLXUTIL.build_part_index (3, 2, 'TDRBIP02B', 'TDRBIP02BX', TRUE);
END;
-------------------------END----------------------------
你可以在單個表分區上進行全文檢索。
SELECT *
FROM part_tab PARTITION (p_tab4)
WHERE contains (b, 'Oracle') > 0
ORDER BY score;
首先,建立存儲選項參數。制定DATASTORE參數爲FILE_DATASTROE,提示Oracle從文件路徑中索引文本。
BEGIN
下一步,建立保存這些文件名稱的表。Id列是主鍵,title列是對文本的簡單說明,thefile列保存着磁盤中Path目錄下文件的名稱。文件必須能夠在Path路徑下找到,否則Oracle會報文件無法訪問的錯誤信息。
然後向表中插入數據,注意:thefile列保存的必須是服務器上的指定的Path路徑下面的文件。
CREATE TABLE mydocs( id NUMBER PRIMARY KEY, title VARCHAR2(255), thefile
VARCHAR2(255) );
INSERT INTO mydocs( id, title, thefile ) VALUES( 1, 'Document1', 'WordDoc1.doc');
INSERT INTO mydocs( id, title, thefile ) VALUES( 2, 'Document2', 'WordDoc2.doc');
INSERT INTO mydocs( id, title, thefile ) VALUES( 3, 'Document3', 'WordDoc3.doc');
CREATE INDEX mydocs_text_index ON mydocs(thefile) INDEXTYPE IS ctxsys.context
PARAMETERS('datastore my_datastore_prefs Filter ctxsys.info_filter Lexer my_lexer');
--
-- 測試是否索引文件成功
--
SELECT id,title
FROM mydocs
如CTX_DDL.SET_ATTRIBUTE( 'my_datastore_prefs', 'path', 'c:/TEMP;c:/docs' );
如果在2個目錄中均有同名的文件1.doc,如果在thefile列中保存的僅僅是文件名稱 1.doc,則Oracle順序查找路徑下的文件,這樣就會索引2次在C:/TEMP下的文件1.doc. 我們可以通過加上文件的路徑信息。
在維護文檔修改的時候同步索引的問題:
如果你修改了路徑下面的某個文件的內容,加入了文本或者刪除了文本,Oracle在同步的時候不會察覺到文檔的內容的修改。有一個方法可以保證同步:
修改了內容之後,更新一下表thefile的信息,但仍保證文本路徑不變。
UPDATE mydocs
SET thefile = 'c:/source.doc'
WHERE thefile = 'c:/source.dco';
再次執行同步索引的時候,Oracle纔會保持文檔內容同步。
關於建立以及同步索引的時候發生的錯誤信息可以從ctx_user_index_errors用戶視圖中查看。
URL_DATASTORE
類型。它支持Http訪問,和Ftp訪問,本地文件系統的訪問。存儲在文本列中的Url格式如下:
[URL:]<access_scheme>://<host_name>[:<port_number>]/[<url_path>]
http://mymachine.us.Oracle.com/home.html
―――――――――――――――――――――――――――
注:
login:password@ 格式的語法只有在Ftp訪問形式下才有效―――――――――――――――――――――――――――
|
URL_DATASTORE
的一些參數,其中timeout,proxy是經常用到的:
屬性
|
屬性值
|
timeout
|
Specify the timeout in seconds. The valid range is 15 to 3600 seconds. The default is 30.這個參數根據網絡性能調整。
|
maxthreads
|
Specify the maximum number of threads that can be running simultaneously. Use a number between 1and 1024. The default is 8.
|
Urlsize
|
Specify the maximum length of URL string in bytes. Use a number between 32 and 65535. The default is 256.
|
maxurls
|
Specify maximum size of URL buffer. Use a number between 32 and 65535. The defaults is 256.
|
maxdocsize
|
Specify the maximum document size. Use a number between 256 and 2,147,483,647 bytes (2 gigabytes). The defaults is 2,000,000.
|
http_proxy
|
Specify the host name of http proxy server. Optionally specify port number with a colon in the form hostname:port
|
ftp_proxy
|
Specify the host name of ftp proxy server. Optionally specify port number with a colon in the form hostname:port.
|
no_proxy
|
Specify the domain for no proxy server. Use a comma separated string of up to 16 domain names.
|
索引建立過程:
首先建立自己的URL_DATASTORE選項。如下指定了代理,Timeout時間。
BEGIN
ctx_ddl.create_preference ('URL_PREF', 'URL_DATASTORE');
ctx_ddl.set_attribute ('URL_PREF', 'Timeout', '300');
END;
建立存儲Url路徑的表:
CREATE TABLE urls(id NUMBER PRIMARY KEY, url VARCHAR2(2000));
INSERT INTO urls
VALUES (1, 'http:// http://intranet-center/');
INSERT INTO urls
VALUES (2, 'http://founderweb:9080/default.jsp');
COMMIT ;
建立索引,索引Html文件可以使用HTML_SECTION_GROUP:
CREATE INDEX datastores_text ON urls ( url )
INDEXTYPE IS ctxsys.CONTEXT PARAMETERS (
'Datastore URL_PREF Lexer my_lexer Section group ctxsys.HTML_SECTION_GROUP' );
SELECT token_text
FROM dr$datastore_text$i;
9. 常見錯誤
下面就一些常見的錯誤信息給出解釋和解決辦法:
然後同步該索引或者強制刪除它:
解決:chinese_lexer 只支持utf8字符集。現在你面臨抉擇:忍受chinese vgram lexer的愚蠢,或者將數據庫字符集改到 utf8, 但面對可能引起你的應用不能正確處理中文的風險(先諮詢Oracle support, 並且與你的應用軟件提供商聯繫)。
ORA-29856: err when execute ODCIINDEXDROP
解決:這是intermedia的某個object 沒有正確產生或者編譯。用ctxsys用戶登錄後,運行:
和
$Oracle_home/ctx/admin/dr0plb.sql 以重新產生所有的package.你也可以直接察看dba_objects視圖,找出那些屬於ctxsys用戶並且status爲invalid的東西,重新產生或者重新編譯。(你可能會發現有許多這種東西,不要驚訝,Oracle不會因此而崩潰)。
解決:引起這個問題可以有多種原因,首先你可以將sort_area_size這個參數減小到不多於2M,這可以防止Oracle在創建索引時分配太多的sort 內存而耗盡資源。 但如果這不起作用,而且你是8.1.7, 則恭喜,你hit 了bug 1391737. 該bug 在你要建索引的字段,如果某條記錄的長度超過2000字符時引起Oracle耗盡內存資源。別無它法,除了打 8.1.7.1B 的補丁。
10.1 ORACLE TEXT資源:
- 4x400Mhz Sun Sparc CPUs
- 4 gig of RAM
- EMC symmetrix (24 disks striped)
- Parallel degree of 5 with 5 partitions
- Index memory of 600MB per index process
- XML news documents that averaged 5K in size
- USER_DATASTORE
See Also:
Oracle Text Reference to learn more about Oracle Text system parameters.
Oracle9i Database Administrator's Guide for more information on setting SGA related parameters.
Oracle9i Database Performance Guide and Reference for more information on memory allocation and setting the SORT_AREA_SIZE parameter.
|
Note:
It is no longer necessary to create a partitioned table to index in parallel as was the case in earlier releases.
|
Note:
When you create a local index in parallel as such (which is actually run in serial), subsequent queries are processed in parallel by default. Creating a non-partitioned index in parallel does not turn on parallel query processing.
Parallel querying degrades query throughput especially on heavily loaded systems. Because of this, Oracle recommends that you disable parallel querying after indexing. To do so, use ALTER INDEX NOPARALLEL.
|
How do I create a local partitioned index in parallel?
See Also:
Oracle Text Reference to learn more about using this procedure. |
See Also:
Oracle Text Reference to learn more about using CTX_DDL.SYNC_INDEX.
|
See Also:
Oracle Text Reference to learn more about using the CTX_REPORT package.
|
<%@ page import="java.sql.* , Oracle.jsp.dbutil.*" %>
<jsp:useBean id="name" class="Oracle.jsp.jml.JmlString" scope="request" >
<jsp:setProperty name="name" property="value" param="query" />
</jsp:useBean>
<%
String connStr="jdbc:Oracle:thin:@localhost:1521:betadev";
java.util.Properties info = new java.util.Properties();
Connection conn = null;
ResultSet rset = null;
Statement stmt = null;
if (name.isEmpty()) { %>
<html>
<title>search1 Search</title>
<body>
<center>
<form method=post>
Search for:
<input type=text name=query size=30>
<input type=submit value="Search">
</form>
</center>
<hr>
</body>
</html>
<%
}
else {
%>
<html>
<title>Search</title>
<body>
<center>
<form method=post action="search_html.jsp">
Search for:
<input type=text name="query" value=<%= name.getValue() %> size=30>
<input type=submit value="Search">
</form>
</center>
<%
try {
DriverManager.registerDriver(new Oracle.jdbc.driver.OracleDriver() );
info.put ("user", "ctxdemo");
info.put ("password","ctxdemo");
conn = DriverManager.getConnection(connStr,info);
stmt = conn.createStatement();
String theQuery = request.getParameter("query");
String myQuery = "select /*+ FIRST_ROWS */ rowid, tk, title, score(1)
scr from search_table where contains(text, '"+theQuery+"',1 ) > 0 order by
score(1) desc";
rset = stmt.executeQuery(myQuery);
String color = "ffffff";
int myTk = 0;
String myTitle = null;
int myScore = 0;
int items = 0;
while (rset.next()) {
myTk = (int)rset.getInt(2);
myTitle = (String)rset.getString(3);
myScore = (int)rset.getInt(4);
items++;
if (items == 1) {
%>
<center>
<table border="0">
<tr bgcolor="#6699CC">
<th>Score</th>
<th>Title</th>
</tr>
<% } %>
<tr bgcolor="#<%= color %>">
<td> <%= myScore %>%</td>
<td> <%= myTitle %>
</td>
</tr>
<%
if (color.compareTo("ffffff") == 0)
color = "eeeeee";
else
color = "ffffff";
}
} catch (SQLException e) {
%>
<b>Error: </b> <%= e %><p>
<%
} finally {
if (conn != null) conn.close();
if (stmt != null) stmt.close();
if (rset != null) rset.close();
}
%>
</table>
</center>
</body></html>
<%