What does python crawler technology do better?

Basic crawler: (1) Basic library: urllib module /requests third-party module First of all, reptiles want to grab the information we need from web pages, so we need to learn urllib/requests modules, which are responsible for crawling web pages. Here, you can use any habit you think you use, just choose a skilled one. I recommend readers to use the requests module, because it is much simpler and easier to operate and understand, so requests is called "humanized module". (2) Multi-process, multi-thread, collaborative process and distributed process: Why should we learn the four major knowledge? If you want to grab 2 million pieces of data, using ordinary single process or single thread, it may take a week or more to grab and download these data. Is this the result you want to see? Obviously, single process and single thread do not meet our pursuit of high efficiency, which is a waste of time. As long as many processes and multi-threads are set, the speed of crawling data can be improved by 10 times or even higher efficiency. (3) Web page parsing and extraction library: xpath/BeautifulSoup4/ regular expression crawls down the source code of the Web page through the previous (1) and (2). There is a lot of information here that we don't want, so we need to filter out useless information and leave valuable information for us. There are three kinds of parsers, each with its own characteristics and shortcomings in different scenarios. Generally speaking, it will be more convenient to learn to use it flexibly. Recommend it to friends who don't know much about reptiles or just get started. Learning BeautifulSoup4 is easy to master, can be quickly applied to actual combat, and is also very powerful. (4) Anti-interception: sometimes the request header/proxy server /cookie will fail when crawling the webpage, because other people's websites have set anti-interception measures. At this time, we need to disguise our behavior so that the other website will not notice that we are reptiles. Request header settings, mainly to simulate the behavior of the browser; If IP is blocked, it needs to be cracked by proxy server; Cookie are simulated as login behavior to enter the website.