How to write a Zhihu crawler in JAVA?

Let's explain the source code of Zhihu crawler and the main technical points involved:

(1) package organization

(2) Simulated login (the main technical point of crawler 1)

Simulated login is a necessary and often difficult step to crawl the website data that needs to be logged in. The simulated login of Zhihu reptile can be a good case. It takes two steps to realize the simulated login of a website: (1) Analyze the login request process and find the key requests and steps. Analysis tools can include IE (shortcut key F 12), Fiddler, HttpWatcher；; (2) Write code to simulate the login process.

(3) Web page download (technical points of Crawler 2)

After simulated login, you can download the target webpage html. Zhihu crawler has written a network connection thread pool based on HttpClient, which encapsulates two commonly used methods for downloading web pages: get and post.

(4) Automatically obtaining web page codes (main technical point 3 of crawler)

Automatic acquisition of webpage code is the premise to ensure that the downloaded webpage html does not appear garbled. The method provide by Zhihu crawler can solve that problem that most download web pages are garbled.

(5) Web page parsing and extraction (the main technical point of crawler 4)

There are two common methods to write a crawler in Java and parse and extract web pages: using open source Jar package Jsoup and regularization. Generally speaking, Jsoup can solve the problem, and it is rare that Jsoup cannot parse and extract it. The powerful function of Jsoup make parsing and extraction extremely simple. Zhihu reptiles use Jsoup.

(6) Conventional matching and extraction (key points of reptile technology 5)

Although Zhihu crawler uses Jsoup to parse web pages, it still encapsulates the method of regular matching and extracting data, because regular can also do other things, such as using regular filtering to judge url addresses in Zhihu crawler.

(7) Deduplication (the main technical point of crawler 6)

For reptiles, there are different deduplication schemes according to different scenes. (1) For small data, such as tens of thousands or hundreds of thousands, you can use Map or Set(2) for medium data, such as millions or tens of millions, which can be solved by BloomFilter. (3) A lot of data, hundreds of millions or billions, can be solved by Redis. Zhihu crawler gives the implementation of BloomFilter, but uses Redis to eliminate duplication.

(8) Advanced Java programming practices such as design patterns.

In addition to the main technical points of the above crawler, the implementation of Zhihu crawler also involves a variety of design patterns, mainly including chain pattern, monomer pattern, combination pattern and so on, and also uses Java reflection. In addition to learning crawler technology, this is also a good case of learning design patterns and Java reflection mechanism.

4. Show some grab results