How to use Oracle full-text index better

There are many ways to search for text in Oracle database without using Oracle text function, which can be realized by using standard INSTR function and LIKE operator.

SELECT * FROM my text WHERE INSTR(the text，' Oracle ')& gt； 0;

SELECT * FROM mytext, where the text is similar to "%Oracle%";

In many cases, it is ideal to use instr and like, especially when the search only spans a small table. However, these text positioning methods will lead to full table scanning, consume resources and have very limited search function. Therefore, when searching massive text data, it is suggested to use the full-text search function provided by oralce to establish full-text search. Step 1: Check and set the database role. First, check whether there are CTXSYS users and CTXAPP cornerstones in the database. Without this user and role, this means that the intermedia function was not installed when the database was created. You must modify the database to install this feature. By default, users of ctxsys are locked, so users of ctxsys should be enabled first. Step 2: Grant the execution authority of ctx_ddl under ctxsys user to users who want to use full-text indexing, for example:

Grant execute on ctx_ddl to pomoho.

Step 3: Set up Lexical Analyzer (lexer)

Oracle realizes full-text retrieval, and its mechanism is actually very simple. That is, all the ideographic units (called $ term in Oracle Bone Inscriptions) in the article are found by the lexer patented by Oracle Bone Inscriptions, and recorded in a set of tables beginning with dr$, and the information such as the location, frequency and hash value of $ term are also recorded. When searching, Oracle looks up the corresponding $ term from this set of tables, calculates its frequency of occurrence, and calculates the score of each document according to an algorithm, which is called' matching rate'. Lexer is the core of this mechanism, which determines the efficiency of full-text retrieval. Oracle provides different lexer for different languages, and we can usually use three of them:

N basic_lexer: for English. It can separate English words from sentences according to spaces and punctuation marks, and it can also automatically treat some words that appear too frequently and have lost their retrieval meaning as' junk', such as if and is. And the processing efficiency is high. However, there are many problems in the use of this word in Chinese. Because it only recognizes spaces and punctuation marks, and there are usually no spaces in Chinese sentences, it will treat the whole sentence as a $ term, in fact, it will lose its retrieval ability. Take the sentence "The people of China have stood up" as an example. The result of basic_lexer analysis has only one $ term, that is, "China people stand up". If you search for "China" at this time, you will not be able to retrieve related content.

N chinese_vgram_lexer: a special Chinese analyzer supporting all Chinese character sets (zhs16gb231280zhs16gbkzht32euchzht16big5zht32trisht16mswin950zh. The analyzer analyzes Chinese sentences in units of words. The sentence "China people stand up" will be analyzed by it as "China", "China", "China", "people", "people stand up", "get up", "come" and "come". It can be seen that this analysis method, the implementation algorithm is very simple, and it can achieve' catch all', but the efficiency is not satisfactory.

N chinese_lexer: This is a new Chinese analyzer, which only supports utf8 character set. As we have seen above, the analyzer Chinese vgram lexer is very mechanical, because it doesn't know the commonly used Chinese words. For example,' people standing' and' standing up' above will never appear alone in Chinese, so this kind of $ term is meaningless and will affect efficiency. The biggest improvement of chinese_lexer is that the parser can understand most commonly used Chinese words, so it can analyze sentences more efficiently. The above two stupid units will not appear again, which greatly improves the efficiency. But it only supports utf8. If your database is zhs 16gbk character set, you can only use stupid Chinese vgram lexer.

If nothing is set, Oracle uses the parser basic_lexer by default. To specify which lexical analyzer to use, you can do this:

First, create a preference under the current user (for example, execute the following statement under the pomoho user).

exec CTX _ DDL . create _ preference(' my _ lexer '，' Chinese _ vgram _ lexer ')；

Secondly, when building a full-text index, specify the lexical analyzer used:

Create an index on mytable(mycolumn). myindex index type is ctxsys.context

Parameter ('lexermy _ lexer');

The full-text index thus established will use chinese_vgram_lexer as the parser.

Step 4: Create an index.

Use the following syntax to create a full-text index

Create an index [schema]. ] index on the schema. ] table (column) index type is ctxsys.context [ONLINE]

LOCAL[(PARTITION[PARTITION][PARAMETERS(' paramstring ')]

[，PARTITION[PARTITION][PARAMETERS(' paramstring ')]]]]

[PARAMETERS(paramstring)][PARALLEL n][UNUSABLE]；

Example:

Create index CTX _ idx _ menuname(menuname) on pubmenu.

Indextype is the ctxsys.context parameter ('lexer my_lexer').

Use index

Using full-text indexing is simple and can be achieved in the following ways:

Select * from pub menu where contains (menu,' upload picture') > 0

Types of full-text indexes

The established Oracle Text index is called domain index, which includes four index types:

L context

2 CTXCAT

3 CTXRULE

4 CTXXPATH

Depending on your application and text data type, you can choose any one.

Create a full-text index for multiple fields

Many times, it is necessary to query records that meet the requirements from multiple text fields. At this time, it is necessary to establish a full-text index for multiple fields. For example, if full-text retrieval is required from the subjectname and briefintro of pmhsubjects, you need to perform the following steps:

It is recommended that multi-field indexes be preferred.

EXEC CTX _ DDL . create _ preference(' CTX _ idx _ subject _ pref '，

Multi-column data storage');

Establish field values corresponding to preferences (log in as ctxsys)

EXEC CTX _ DDL . set _ attribute(' CTX _ idx _ subject _ pref '，' columns '，' subjectname，brief intro ')；

Establish full-text index

Create index ctx_idx_subject on pmhsubjects(subjectname).

INDEXTYPE ISctxsys。 Context parameters ('data store ctxsys.ctx _ IDX _ subject _ preflexermy _ lexer')

Use index

Select * from pmh subjects where contains (topic name, "Chris Lee") > 0.

Maintenance of full-text index

For CTXSYS. Context index, after the application performs DML operation on the base table, it needs to maintain the index of the base table. Index maintenance includes index synchronization and index optimization.

After the index is established, we can find that Oracle automatically generated the following table under this user: (suppose the index is named myindex):

Dr. $myindex$I, Dr. $myindex$K, Dr. $myindex$R and Dr. $myindex$N, among which Table I is the most important. You can check this table to see what it contains:

SELECT token_text, token _ count from dr $ i _ rsk1$ i where rownum < = 20;

I won't list the queries here. I took it. As you can see, what is saved in this table is actually the $ term record generated by Oracle after analyzing your document, including the location, frequency and hash value of $ term. When the content of the document changes, it is conceivable that the content of this I table should also change accordingly to ensure that Oracle can retrieve the content correctly when doing full-text retrieval (because the so-called full-text retrieval is actually the core of querying this table). This uses sync and optimize.

Sync: save the new $ term to the I table;

Optimization: cleaning up the garbage in table I, mainly deleting the deleted $ term in table I.

When an indexed document in the base table is inserted, updated or deleted, the changes in the base table will not immediately affect the index until the index is synchronized. You can query the view CTX _ User _ Pending to see the corresponding changes. For example:

SELECT pnd_index_name，pnd_rowid，

TO _ CHAR(PND _ timestamp,' dd-mon-yyyy hh24:mi:ss') timestamp.

From ctx _ user _ pending

The output of this statement is similar to the following:

PND index name PND ROWID timestamp

- - -

MYINDEX aaadxnaabaaas 3 saac 06-oct- 1999 15:56:50

Synchronization and optimization method: You can use ctx_ddl package provided by Oracle to synchronize and optimize indexes.

1. For an index of CTXCAT type, Oracle will automatically maintain the index when performing DML operation on the base table. Changes to the document are immediately reflected in the index. CTXCAT is an index of transaction forms.

Synchronization of indexes

Synchronize indexes after inserting, modifying, and deleting base tables. It is recommended to use sync to synchronize indexes. Grammar:

Ctx _ ddl.sync _ index (

Idx_name in VARCHAR2 is empty by default.

The memory in VARCHAR2 is empty by default.

Part_name in VARCHAR2 is empty by default.

Parallel_degree IN NUMBER default1);

Idx_name index name

Memory specifies the memory required to synchronize the index. The default value is the system parameter DEFAULT_INDEX_MEMORY.

Specifying large memory can speed up the index efficiency and query speed, and the index fragmentation is less.

Part_name synchronizes which partition index.

Parallelism parallel synchronization index. Set the parallelism.

For example:

Synchronization index myindex: exectx _ DDL.sync _ index ('myindex');

Implementation suggestion: It is recommended to synchronize the index through oracle's job.

Index optimization

Frequent index synchronization will lead to the fragmentation of context index. Index fragmentation seriously affects the response speed of queries. It can optimize the index regularly, reduce fragmentation, reduce the index size and improve the query efficiency.

When you delete text from a table, Oracle Text marks the deleted document, but does not immediately modify the index. Therefore, the existing document information takes up unnecessary space, resulting in additional query costs. You must optimize the index in full mode and delete invalid old information from the index. This process is called garbage disposal. When you often update and delete table text data, garbage disposal is necessary.

exec CTX _ DDL . optimize _ index(' myidx '，' full ')；

Suggestion: Optimize the full-text index every day when the system is idle to improve the retrieval efficiency.

P.s. time series optimization index

3. Optimize synchronization domain index regularly.

Create scheduled tasks, regularly optimize and synchronize domain indexes.

Hsp_sync_index of SQL> creation or replacement process is

2 Start.

3 CTX _ DDL . sync _ index(' id _ cont _ msg ')；

4 end;

5 /

Program has been created.

Time: 00:00:00.08

SQL> variable work order number;

SQL> start

2 DBMS_JOB。 Submit (:job number,' HSP _ sync _ index ();' ,

3 SYSDATE，' SYSDATE+( 1/24/4)')；

4 submit;

5 end;

6 /

PL/SQL procedure completed successfully.

Time: 00:00:00.27

SQL> creates or replaces the procedure hsp_optimize_index with

2 Start.

3 CTX _ DDL . optimize _ index(' id _ cont _ msg '，' FULL ')；

4 end;

5 /

SQL> variable work order number;

SQL> start

2 DBMS_JOB。 SUBMIT(:jobno，' HSP _ optimize _ index()；' ,

3 SYSDATE，' SYSDATE+ 1 ')；

4 submit;

5 end;

6 /

Program has been created.

Time: 00:00:00.03

PL/SQL procedure completed successfully.

Time: 00:00:00.02

SQL & gt