Search engine classification
Search engines can be divided into three types according to their working methods, namely, full-text search engines, directory index search engines and meta-search engines.
Full-text search engine
Full-text search engines are veritable search engines, including Google, Fast/AllTheWeb, AltaVista, Inktomi, Teoma, WiseNut and so on. As a representative of foreign countries, Baidu is a well-known domestic enterprise. They are all real search engines based on the information of various websites (mainly web pages) extracted from the Internet, searching for relevant records that match users' query conditions, and then returning the results to users in a certain order.
From the source of search results, full-text search engines can be subdivided into two types. One is to have its own indexer, commonly known as "spider" program or "robot" program, to establish its own web database, and the search results are directly called from its own database, such as the seven engines mentioned above; The other is to rent the database of other engines and arrange the search results in a custom format, such as Lycos engine.
Search index/directory
Although the directory index has a search function, it is not a real search engine in the strict sense, but a list of website links classified by directory. Users don't need to search for keywords, but they can find the information they need by classifying directories. The most representative in the directory index is the famous Yahoo Yahoo. Other famous projects include open directory project Project (DMOZ), LookSmart, About, etc. Domestic Sohu, Sina and Netease searches also belong to this category.
Meta search engine (meta search engine)
When the meta search engine accepts the user's query request, it searches on other engines at the same time and returns the results to the user. Famous meta-search engines include InfoSpace, Dogpile, Vivisimo, etc. (list of meta search engines), and the representative Chinese meta search engine is the search engine. In terms of ranking search results, some directly rank search results according to the source engine, such as Dogpile, and some rearrange the results according to their own rules, such as Vivisimo.
In addition to the above three types of engines, there are the following non-mainstream forms:
Aggregation search engine: such as the engine launched by HotBot at the end of 2002. This engine is similar to a meta-search engine, but the difference is that instead of calling multiple engines to search at the same time, users choose from the four engines provided, so it is more accurate to call it an "aggregate" search engine.
Portal search engines: Although AOL search and MSN search provide search services, they have neither classified directories nor web databases, and the search results are completely from other engines.
FFA: Generally, such websites are only simple scrolling links, and a few have simple classified directories, but the scale is much smaller than Yahoo and other directory indexes.
Because the above websites provide search and query services for users, we usually call them search engines for convenience.
The basic working principle of search engine
Understanding the working principle of search engine will be of great help to our daily search application and website submission and promotion.
Full-text search engine
In the part of search engine classification, we mentioned the concept that full-text search engines extract information from websites to establish web databases. Search engines have two automatic information collection functions. One is regular search, that is, every once in a while (for example, Google is generally 28 days), the search engine actively sends out a "spider" program to search Internet sites within a certain IP address range. Once a new website is found, it will automatically extract the website information and URL and add it to its own database.
The other is to submit a website search, that is, the website owner voluntarily submits the website address to the search engine, and the search engine sends a "spider" program to your website within a certain period of time (ranging from 2 days to several months), scans your website, and stores relevant information in the database for users to query. Because the indexing rules of search engines have changed a lot in recent years, actively submitting web addresses does not guarantee that your website can enter the search engine database, so the best way at present is to obtain more external links, so that search engines have more opportunities to find you and automatically include your website.
When users search for information through keywords, the search engine will search in the database. If you find a website that meets the user's requirements, you will use a special algorithm-usually based on the matching degree of keywords in the webpage, the location/frequency of occurrence, the quality of links, and so on. -Calculate the relevance and ranking level of each webpage, and then return these webpage links to the user in order according to the relevance.
Search index/directory
Compared with full-text search engines, directory indexes have many differences.
First of all, the search engine belongs to automatic website retrieval, while the directory index depends entirely on manual operation. After the user submits the website, the catalog editor will personally browse your website, and then decide whether to accept your website according to a set of customized evaluation criteria or even the editor's subjective impression.
Secondly, when a search engine includes a website, as long as the website itself does not violate the relevant rules, it can generally log in successfully. However, the directory index has much higher requirements for the website, and sometimes even if you log in many times, you may not be successful. Especially like Yahoo! Such a super index is even more difficult to log in. (due to login to Yahoo! Is the most difficult, is a battleground for online marketing, so we will introduce the skills of logging in to Yahoo in a special space later.
In addition, when logging into search engines, we generally don't have to consider the classification of websites, but when logging into the directory index, we must put the websites in the most suitable directory.
Finally, the relevant information of each website in the search engine is automatically extracted from the user's webpage, so from the user's point of view, we have more autonomy; However, the directory index requires you to fill in the website information manually, and there are various restrictions. What's more, if the staff thinks that the directory and website information you submitted is inappropriate, he can adjust it at any time, of course, without consulting you in advance.
Directory index, as its name implies, is to store websites in corresponding directories, and users can choose keyword search or layer-by-layer search according to classified directories when querying information. If you search by keywords, the returned results are the same as those of search engines, and the websites are also arranged according to the degree of information relevance, but there are more human factors. If you search by hierarchical directory, the ranking of a website in a directory is determined by the order of title letters (there are exceptions).
At present, search engines and directory indexes have the trend of mutual integration and penetration. Some pure full-text search engines now also provide directory search. For example, Google borrowed the Open Directory directory to provide classified queries. Just like Yahoo! These old directory indexes have expanded the search scope by cooperating with search engines such as Google. In the default search mode, some directory search engines will first return the matching websites in their own directories, such as Sohu, Sina and Netease in China. Others, such as Yahoo, default to web search.
The third law of search engine
Today, it is time for search engines to end the past and open up the future. To clarify what I mean by the third law, let's first review the first and second laws.
The first law of relativity
Sounds like an academic paper. Indeed, even the first law and the second law have never been mentioned before, but the contents of the first law and the second law have long been recognized in the industry and academia. In fact, this first law was widely studied by the academic circles long before the appearance of the Internet, which is the so-called relevance law. At that time, this field was called information retrieval, or information retrieval, and some were called full-text retrieval.
At that time, the relevance was based on word frequency statistics, that is, users entered search words, and search engines searched for those search words with high frequency and important positions in articles (web pages), plus some weights of the common degree of search words themselves, and finally discharged a result (search results page). Early search engine results ranking is based on the first law of this paper, such as Infoseek, Excite, Lycos and so on. They basically followed the academic research results before the Internet era, and the industry focused on dealing with large traffic and big data, but the correlation ranking did not break through.
In fact, word frequency statistics does not use any network-related features at all, and it is a technology in the pre-network era. The main documents in the internet age are in the form of web pages, and almost everyone can publish all kinds of content on the internet at will. The quality of two web pages with the same word frequency can be very different, but according to the first law of search engines, the rankings of these two web pages should be the same. In order to rank high in some search results, many web content producers rack their brains and pile up keywords on their own pages, which makes search engines hard to prevent and suffer. This situation began to change in 1996.
The second law, the law of fashion and quality
1April, 1996, went to Las Vegas to hold an academic conference on information retrieval. The content of the meeting is as boring as the weather in Las Vegas. But I am far away from the company, but I rarely have the opportunity to calm down and think seriously. Just when I was listening to an unimportant paper lecture, I suddenly linked the mechanism of scientific citation index with the hyperlink on the web page-thanks to Peking University, who taught me the mechanism of scientific citation index in my junior year. I'm afraid no university in the United States will teach it when you are an undergraduate.
To put it bluntly, the mechanism of scientific citation index means that whoever is cited more times is regarded as an authority and a paper is a good paper. This idea is transplanted to the internet, that is, whoever has more links to a webpage is considered to be of high quality and popular. Coupled with the corresponding link text analysis, it can be used in the ranking of search results. This leads to the second law of search engines: the law of popularity quality. According to this rule, the relevance ranking of search results depends not entirely on word frequency statistics, but more on hyperlink analysis.
I realized that this was a breakthrough, and I quickly summed up my ideas after I went back. 1June, 1996, I applied for an American patent in this field. 1On July 6th, 999, the United States Patent and Trademark Office approved the patent No.5920859 with me as the sole inventor. Around the end of 1996, two graduate students in the computer department of Stanford University came up with the same solution. Later, they created a search engine called Google. Google website still says that their technology is being patented. I wonder if the US Patent Office will grant such a patent again. In any case, the method of hyperlink analysis has been gradually accepted by major search engines since 1998. Because link is a fundamental feature of network content, search engines at this time began to really use the retrieval technology in the network era.
Anything could happen. Since 2000, the internet bubble has burst rapidly, and major search engines have either been acquired or failed to go public, and all search engine companies that use popularity quality method have not been spared. So, where is the way out for search engines?
The third law, the law of self-confidence
The mass quality law also solves a technical problem. However, since its birth, search engine has never been a purely technical phenomenon, which combines many factors such as technology, culture and market. To solve the survival and development problems of search engine companies, we need the third law of search engines-the law of self-confidence.
1998, not many people took a newly established company named GoTo.com (now renamed Overture) 500 miles from Silicon Valley seriously. It just buys the technical service of a search engine, and then auctions their websites' ranking in GoTo search results to the owners of those websites. Whoever pays more ranks first, and the payment is calculated according to the situation of netizens clicking on the website. It only appears in the search results without paying. This is the earliest practitioner of the law of confidence! According to this rule, besides word frequency statistics and hyperlink analysis, the relevance ranking of search results pays more attention to auction. Whoever has confidence in his website will rank first. The sign of self-confidence is willingness to pay for this ranking. It should be declared that the law of self-confidence is also my own name for this model, and no one has summarized it in the previous literature.
Today, when the Internet industry is in recession and Nasdaq is in full swing, GoTo is in full swing, with a market value of 654.38+03 billion dollars and revenue of 35% of Yahoo's total revenue. On the other hand, which portal can get one-third of the total revenue from their search engine services? The reason is that Goto first practiced the self-confidence rule of search engines. In the past, search engines relied on CPM to charge fees, while CPM borrowed from the traditional advertising industry, without considering the characteristics of online media such as immediacy, interactivity and easy bidding. However, bidding ranking and pay-per-click directly provide sales leads for website owners, rather than advertisements in the traditional sense. The law of self-confidence has changed the embarrassing situation that search engines used to rely on CPM to collect money, and created a charging model that truly belongs to the Internet.