The position and function of search engine in network information retrieval _ the relationship between search engine and information retrieval

Information retrieval is not equal to search engine.

The development of Internet has obviously promoted the development and application of information retrieval technology, and a large number of search engine products have been born, which provides a good tool for netizens to quickly obtain information and navigate network information. However, it is a misunderstanding to equate information retrieval with the use of search engines. Full-text information retrieval technology is also widely used in search engine technology, but Internet information search and enterprise information search are different.

The first is the amount of data. The general index database of traditional information retrieval system is mostly GB level, but Internet web search needs to process tens of millions of web pages. The basic strategy of search engine is to adopt search server cluster, which is inappropriate and unnecessary for most enterprise applications.

The second is content relevance. There is too much information, so it is particularly important to find and sort it out. Search engines such as Google have developed web page link analysis technology, which is based on the number of web page connections on the Internet. However, the page links in an enterprise website are determined by the website content editing and publishing system, and the number of links is accidental, which cannot be used as a basis for judging the importance. The retrieval of real enterprise applications needs to be sorted based on the relevance of content, that is to say, the information most related to the retrieval needs is ranked in front of the retrieval results, which is basically not feasible through link analysis technology.

The third is real-time. The index generation and retrieval services of search engines are separated, and the data is updated and synchronized regularly. The update cycle of large search engines needs to be measured in weeks or even months. However, enterprise information retrieval needs to reflect the changes of internal and external information in real time, and the search engine system mechanism can not meet the requirements of dynamic growth and modification of enterprise data.

The fourth is security. Internet search engines are all based on file systems, but the contents of enterprise applications are generally stored safely and centrally in data warehouses to ensure data security and management requirements.

The fifth is personalization and intelligence. Due to the limitation of search engine data and customer scale, it is difficult to apply computation-intensive intelligent technologies such as relevant feedback, knowledge retrieval and knowledge mining, while the information retrieval application specifically for enterprises can go further in intelligence and personalization.

(), usually refers to text information retrieval, including information storage, organization, presentation, query, access and so on, and its core is the indexing and retrieval of text information. Historically, information retrieval has experienced many stages of development, such as manual retrieval, computer retrieval, networking and intelligent retrieval.

At present, information retrieval has developed to the stage of networking and intelligence. The object of information retrieval has expanded from the relatively closed, stable and consistent information content managed by independent databases to the open, dynamic, rapidly updated, widely distributed and loosely managed Web content. The users of information retrieval have also expanded from the initial information professionals to the general public, including business people, managers, teachers and students, professionals and so on. They put forward higher and more diversified requirements for information retrieval from results to methods. Adapting to the needs of networking, intelligence and personalization is a new trend in the development of information retrieval technology.

Hot spots of information retrieval technology

◆ Intelligent retrieval or knowledge retrieval?

Traditional full-text retrieval technology is based on keyword matching, which often has the phenomena of incomplete retrieval, inaccurate retrieval and low retrieval quality, especially in the network information age, it is difficult to meet people's retrieval needs through keyword matching. Intelligent retrieval uses word segmentation dictionary, synonym dictionary and homophone dictionary to improve the retrieval effect. For example, users can query "computer" and retrieve information related to "computer"; Further, it can assist the query of knowledge level or concept level, form a knowledge system or concept network through subject dictionary, upper and lower dictionaries and related dictionaries, give users intelligent knowledge tips, and finally help users get the best retrieval effect. For example, users can further narrow the query scope to "microcomputer" and "server" or expand the query scope to "information technology" or related "electronic technology", "software" and "computer application". In addition, intelligent retrieval also includes ambiguous information and retrieval processing, such as whether "Apple" refers to fruit or computer brand. The distinction between "China people" and "China people" will be processed by combining ambiguous knowledge description database, full-text indexing, user retrieval context analysis and user-related feedback, so as to efficiently and accurately feed back the most needed information to users.

◆ Knowledge mining

At present, it mainly refers to the development of text mining technology, which aims to help people find, organize and express information better and extract knowledge to meet the high-level demand of information retrieval. Knowledge mining includes abstraction, classification (clustering) and similarity retrieval.

Automatic summarization is to automatically extract abstracts from original documents by computer. In information retrieval, automatic summarization helps users to quickly evaluate the relevance of retrieval results. In information service, automatic summarization helps to distribute various forms of content, such as sending it to PDA and mobile phone. Similarity retrieval technology is to retrieve similar or related documents based on the content characteristics of documents, which is the basis of realizing personalized feedback from users and can also be used for de-duplication analysis. Automatic classification can be based on statistics or rules, form a predefined classification tree through machine learning, and then classify according to the content characteristics of documents; Automatic clustering is grouping and merging according to the relevance of document content. Automatic classification (clustering) is very useful in information organization and navigation.

◆ Integrated retrieval and holographic retrieval of heterogeneous information

Under the trend of distributed and networked information retrieval, the openness and integration of information retrieval system are increasingly demanding, and it is necessary to be able to retrieve and integrate information from different sources and structures, which is the basic point of the development of heterogeneous information retrieval technology, including supporting files of various formats, such as TEXT, HTML, XML, RTF, MSOffice, PDF, PS2/PS, MARC, ISO2709, etc. Support multilingual information retrieval; Support the unified processing of structured data, semi-structured data and unstructured data; And other open retrieval interfaces. The so-called "holographic retrieval" concept is to support all formats and ways of retrieval. From the current practice, to the level of heterogeneous information integrated retrieval, the integration of human-computer interaction and multimedia information retrieval based on natural language understanding needs further breakthrough.

In addition, from the perspective of engineering practice, the comprehensive use of multi-level cache of internal memory and external memory, distributed clustering and load balancing technology are also important aspects of the development of information retrieval technology.

With the popularity of the Internet and the development of e-commerce, the amount of information that enterprises and individuals can obtain and need to process has exploded, most of which are unstructured and semi-structured data. The importance of content management is increasingly prominent. As the core supporting technology of content management, information retrieval will be applied to various fields with the development and popularization of content management, and become a close partner in people's daily work and life.

Information retrieval originated from the reference service and abstract index of the library. It first developed in the second half of19th century, and by the 1940s, indexing and retrieval had become an independent tool and user service project of libraries.

With the advent of the first electronic computer in the world in 1946, computer technology gradually entered the field of information retrieval, and was closely combined with information retrieval theory; Off-line batch information retrieval system and online real-time information retrieval system have been successfully developed and commercialized. From 1960s to 1980s, with the promotion of information processing technology, communication technology, computer and database technology, information retrieval developed rapidly in the fields of education, military and commerce, and was widely used. Dialog international online information retrieval system is the representative of information retrieval field in this period, and it is still one of the most famous systems in the world.

Search engine workflow

Internet is a treasure house, and search engine is a key to it. However, the vast majority of netizens lack the knowledge and skills of search engines. A survey abroad shows that about 7 1% people are disappointed with the search results to varying degrees. As the second largest service of the Internet, this situation should be changed.

The rapid development of the Internet has led to the explosive growth of online information. At present, there are more than 2 billion webpages in the world, and 7.3 million webpages are added every day. Finding information in such a vast ocean of information is as difficult as finding a needle in a haystack. Search engine is just a technology to solve this "lost" problem.

The work of a search engine includes the following three processes:

1. Discover and collect web page information on the Internet;

2. Extract information and organize the establishment of an index database;

3. Then, according to the query keywords input by the user, the retriever can quickly check out the documents in the index library, evaluate the relevance between the documents and the query, sort the results to be output, and return the query results to the user.

Discover and collect network information

A high-performance "spider" program is needed to automatically search the information on the Internet. A typical web spider works by looking at a page and finding relevant information from it. Then it starts with all the links on the page, continues to look for relevant information, and so on until it is exhausted. Web spiders need to be fast and comprehensive. In order to browse the whole Internet quickly, web spiders usually use preemptive multithreading technology to collect information on the Internet. By using preemptive multithreading, you can index web pages based on URL links, start a new thread to track each new URL link, and index a new URL starting point. Of course, the open threads on the server can't expand indefinitely, so we need to find a balance between the normal operation of the server and the rapid collection of web pages. The algorithm of each search engine technology company may be different, but the purpose is to browse the web quickly and cooperate with the subsequent processing. At present, domestic search engine technology companies, such as Baidu's Web Spider, adopt customizable and highly extensible scheduling algorithms, which enable searchers to collect the largest amount of Internet information in a very short time and save the obtained information for establishing index databases and user retrieval.

Establishment of index database

It is related to whether users can find the most accurate and extensive information most quickly, and at the same time, an index database must be established quickly to index the web information crawled by web spiders very quickly to ensure the timeliness of information. Using the method of combining content analysis and hyperlink analysis based on web pages to evaluate the relevance of web pages can objectively rank web pages, thus greatly ensuring that the search results are consistent with the user's query string. In the process of indexing website data, Sina search engine establishes an index database according to the appearance of keywords in different positions such as website title, website description and website URL or the quality grade of the website, so as to ensure that the search results are consistent with the user's query string.