Computer information retrieval, in essence, is a process in which the computer compares and matches the input retrieval strategy with the document feature identification and its logical combination stored in the system. Due to the uncertainty of information demand itself, the lack of full understanding of document characteristics in the database, and some limitations of system functions, the retrieval effect will be affected to varying degrees. However, as long as we follow certain retrieval steps and formulate good retrieval strategies, we can reduce the influence of various unfavorable factors, and make the retrieval question mark as consistent as possible with the information demand and retrieval system, so as to retrieve letters that meet the needs of users in the system.
1. Search steps
(1) Define the information demand and retrieval purpose.
Information demand is an objective or subjective demand for all kinds of intelligence information. This demand is the starting point for people to obtain intelligence information, and it is also the basis for selecting database, determining retrieval strategy and evaluating retrieval effect in online information retrieval. Different types of topics have different information needs. For example, novelty retrieval topics such as applying for invention, applying for achievement award, and identifying projects often require comprehensive collection of literature information within a certain discipline. This kind of topic has the characteristics of general survey and traceability, and should be based on comprehensive retrieval; However, for the key research topics to solve a specific problem in scientific research and production, it is often only required that the detected information is helpful to one's own research, and the range of documents searched does not need to be very wide. So this kind of topic needs accurate investigation.
How to correctly analyze information requirements? We might as well analyze the form and content of information demand. Questions that need to be clear about the form of information are:
① Clear the purpose of retrieval. Search is to declare the results, or to understand the latest progress of the subject. So as to work out a retrieval strategy that meets the requirements of recall or precision.
(2) Define the required literature quantity. Setting the upper limit of the required number of documents is an important parameter to determine the future retrieval strategy and control the retrieval cost. At the same time, it is necessary to estimate the amount of relevant documents that may exist in the retrieval subject.
(3) It is also important to determine the language, age range, type, author or other appearance characteristics of the required documents, which is also important for limiting the search scope.
The main problems that need to be clear about the information content are:
(1) It is very important to make clear the main subject range involved in the retrieval of subject content, which is very important for selecting a suitable database in the future.
② It is an important link of online retrieval to analyze the main contents of retrieval topics and express these content requirements in natural language.
(2) Select the database and determine the retrieval approach.
After analyzing the information requirements, we can choose the appropriate database according to the known conditions, which implies the choice of retrieval system. If you want to search foreign patent literature, you can search the domestic GWZL library of BDSIRS system. However, its retrieval approach and report on the latest patent documents are not as good as WPI library of DIALOG system in the United States. When the retrieval requirements are high, American dialogue system is still often used.
When choosing a database, you should first understand:
(1) Subject areas of information collected in the database;
(2) the types of documents included, it is better to further understand the main sources of documents;
③ Time range of inclusion;
(4) The basic index and auxiliary index of the database, and the characteristics of the retrieval methods and retrieval marks provided by them;
⑤ Retrieval cost of database, including machine time cost and printing cost of each record.
After the database is selected, the retrieval methods it provides are also determined, and one or more retrieval methods can be determined according to the known conditions. Because of the large storage capacity and fast operation speed of the computer, the index can be established by multi-domain comparison, which can not only be retrieved from the commonly used subject words, classification numbers and authors, but also from the free words, literature types and periodical names of articles, and can also be cross-retrieved by using the combination of various methods, which is beyond the reach of manual retrieval.
(3) Determine the concept group and retrieval identification of the topic.
It is an important step to determine the concept group and retrieval logo after understanding the information requirements and main contents of the retrieval subject. When the retrieval subject contains complex subject content, several concept groups that make up the subject content should be clearly defined, and a certain compound concept or concept relationship should be formed through a certain logical combination to express the information needs of users.
The concept plane of the topic is determined, and the concept plane must be converted into the corresponding retrieval identification recognized by the system. The representation of retrieval marks should meet two requirements: first, relevance, that is, retrieval marks reflect information needs; The second is matching, that is, the retrieval identifier is consistent with the storage feature identifier of the retrieval system.
Search identification generally has the following three forms:
① Standardized words: select standardized words or phrases from thesaurus or thesaurus of the database to be searched, because thesaurus is the retrieval language that database indexing and retrieval must follow. In order to make the identification of retrieval problems consistent with the identification of document features and obtain the best retrieval effect, we must first choose standardized words.
② Standardized code: the index code is the index unit designated by the database system for certain subject categories or subject concepts. This unit has good specificity and is a document feature recognition with good retrieval effect. Such as international patent classification number IC =, product code PC = of PTS database, standard industrial code SC = and so on.
③ Free words: Free word retrieval can make full use of the full-text retrieval function of the system. The selection of standard word or code needs to use thesaurus or classification table to convert from natural language to standard language, and the different ideas of the searcher and the searcher will also affect the retrieval effect. At this time, using free words to search for titles, abstracts and even full text shows some advantages. Free language is direct and concise, which is a common method acceptable to scientific and technological personnel.
(4) Draw up search questions and determine specific search procedures.
The retrieval question expression refers to the logical expression used to express the user's retrieval question in computer information retrieval, which is composed of retrieval words, various Boolean logic operators, position operators and other combinations and connection symbols specified by the system. In a sense, retrieval style is the concrete embodiment of retrieval strategy, and its quality will be related to the success or failure of retrieval strategy.
After the search tags are determined, the next step is to connect all the search tags in a certain combination relationship to form a search problem and express various complex conceptual relationships to accurately express information needs. Attention should be paid to the use of various logical operators, positional operators and word-cutting operators, such as the compactness and order of positional operators, the restrictive requirements and input order of each search term, and the adjustment of search formulas according to feedback information. See the retrieval policy section.
2. Search strategy
The concept of (1) retrieval strategy
The so-called retrieval strategy is to determine the retrieval system, retrieval documents, retrieval methods and retrieval words, and scientifically arrange the positional relationship, logical relationship and retrieval steps between retrieval words on the basis of analyzing the conceptual units of the subject content. Whether the retrieval strategy is considered comprehensively or not directly affects the recall and precision of documents.
(2) Steps to formulate retrieval strategy
The premise of formulating retrieval strategy is to make clear the basic performance of database and the whole retrieval system. Different databases have different contents, indexing methods and retrieval methods, and different retrieval systems are equipped with different technical performance and operators. Before making the retrieval strategy, there are several retrieval methods in the database, and it is necessary to clearly understand the rules followed by the indexes of these methods. If the retrieval points that the system does not have are listed in the question table, it is impossible to consult the literature.
(3) The basis of formulating retrieval strategy is to find out the content requirements and retrieval purpose of retrieval subject. On this basis, we can analyze the concept of retrieval subject. If the topic belongs to a single concept, we can use a single search word to express it. If the topic concept is complex, we can decompose the complex concept into several concept units, and then match the retrieval phrases that express the concept units with logical operators. When converting conceptual units into search terms, try to choose standardized words. Be especially careful when searching for new topics, marginal disciplines or vague concepts, because these words are often not in the system. Here, we should choose the keywords with retrieval significance in this discipline from the professional category, that is, free words, otherwise it will lead to false detection or missed detection.
(4) The key to the composition of retrieval strategy is to choose words correctly and match them with logical symbols.
(5) Adjust the retrieval strategy. In computer retrieval, there are often too few or even zero documents, or too many documents. As a searcher, we should analyze with users and adjust the retrieval strategy in time to achieve satisfactory results. If there are too many or too few literature resources, we can increase or decrease the search scope by increasing search terms and using Boolean logic to reduce or increase the number of hit documents. Generally speaking, logical sum always narrows the search scope to achieve the purpose of accurate search; Logical or always expand the scope of search to achieve all the purposes of search. However, logical non-exclusive search always narrows the search scope and achieves the purpose of accurate search.
3. Retrieval efficiency
The retrieval efficiency is the effective result when the retrieval system (or tool) is used to carry out the retrieval service. It directly reflects the performance of the retrieval system and affects the competitiveness of the system in the information market and the interests of users. Retrieval efficiency includes two aspects: technical effect and socio-economic effect. Technical effect mainly refers to the performance and service quality of the system and the degree to which the system meets the information needs of users. Socio-economic effect refers to how the system can meet the needs of users economically and effectively, so that users or the system itself can obtain certain socio-economic benefits. What we discuss below is mainly the evaluation of system technical effect.
The most ideal retrieval is that the recall rate and accuracy are 100%, that is, all relevant documents collected in the database are retrieved, and all the retrieved documents are related documents. But in fact, there are many factors that are difficult to achieve this index, and there will always be some errors. Then there are two indicators to evaluate the error: missed detection rate and false detection rate.
In the evaluation work, recall and precision are the most commonly used, and should be used at the same time, otherwise it is difficult to reflect the function of the retrieval system and the efficiency of the retrieval results. The combination of precision and recall describes the relationship between the retrieval success rate, recall and precision of the system, that is to say, the recall increases and the precision decreases, and vice versa. In computer retrieval, it is generally believed that the precision rate is 60-70% and the recall rate is 40-60%.
The scope of the system, indexing language, indexing and retrieval are all factors that affect the recall and precision, so I won't repeat them here.
4. Measures to improve retrieval efficiency
(1) Improve the editing quality of the document library, make its collection scope more comprehensive, meet the needs of the corresponding disciplines or majors, and describe the content more detailed and accurate.
(2) To improve the indexing quality, we should make the indexing consistent, use proper words and make reasonable combinations, and strive to: reveal the theme correctly; Fully embody the theme and do not miss the bid; Use signs concisely and don't abuse standards.
(3) Improve the specificity of indexing language and the quality of thesaurus. Strengthen the control of index vocabulary, improve the structure of thesaurus and its citation relationship, so that the index language is beneficial to both national index and characteristic retrieval. The vocabulary structure should be perfect, the relationship between words should be correct, synonyms and polysemous words should be mastered correctly, and the terms of new disciplines and technologies should be reflected in time.
(4) Improve the working level and ability of the searchers, understand the contents included in the database and deepen the understanding of thesaurus structure, make correct subject analysis, select appropriate retrieval documents, select appropriate search words to express the subject content, make appropriate logical combination, find out the best retrieval method, and thus formulate the best retrieval strategy.
(5) Adjust the recall rate and accuracy.
In actual retrieval, the recall and precision can be adjusted reasonably according to different retrieval requirements, so that the retrieval results can meet the retrieval requirements to the maximum extent. In the actual retrieval, sometimes the recall rate is very high, and it is hoped that the relevant literature will not be omitted, and the precision rate will be lower; Sometimes you just need to browse some new and important articles, not all of them. Here, you need higher accuracy and lower recall rate. In short, the recall and precision should be adjusted reasonably in the retrieval process to achieve the best retrieval effect.