First, thinking analysis
At present, the methods that can be grabbed are:
1. Grab the link to the official WeChat account article directly in WeChat APP (/s? _ _ biz = mjm 5 mzu 4 odk 2ma = = & amp; mid=2735446906。 idx= 1。 sn = ECE 37 deaba 0 c 8 ebb 9 badf 07 e 5a 3 BD 3 & amp; Scene =0#rd)
2. Send the corresponding request through WeChat partner sogou search Engine (/) and grab it indirectly.
In 1 method, this kind of link is not easy to obtain, and its law is not particularly clear.
Therefore, this paper adopts method 2-analyzing the captured data in real time by sending an instant request to weixin.sogou.com and saving it locally.
2. Crawling process
1. First, let's test it on the WeChat search page in sogou, which will make our thinking clearer.
Use the English name of WeChat official account on the search engine to "search WeChat official account" (because the English name of WeChat official account is unique in WeChat official account, and the Chinese name may be repeated, the name of WeChat official account must be completely correct, otherwise many things may be found, which can reduce the data screening work, just find the data corresponding to this unique English name), that is, send a request to' /weixin? type = 1 & amp; Query =% s&ie = utf8 & amp _ sugar _ = n& _ sugar _ type _ ='%? Python', and analyze the homepage jump link corresponding to the official WeChat account of the search result from the page.
2. Get the home page entry content
Use request, urllib, urllib2, or directly use webdriver+phantomjs, etc.
Here, the request.get () method is used to get the content of the portal page.
【python】? View the plain? copy
#? Crawler camouflage head device
self.headers? =? {'User Agent':? Mozilla/5.0? (Windows? NT? 6.3; ? WOW64? rv:5 1.0)? Gecko /20 100 10 1? Firefox/5 1.0'}
#? Set the operation timeout.
Self. Timeout? =? five
#? Crawler simulation is completed in request.session
self.s? =? Request. Session ()
【python】? View the plain? copy
# Search the portal address and search the official WeChat account with the public as the key word?
def? Get_search_result_by_keywords (itself):
Self.log ('search address: %s'? %? self.sogou_search_url)
Return? Self.s.get(self. sogou _ search _ website,? headers=self.headers,? timeout=self.timeout)。 content
The third step is to obtain the official account address of WeChat.
There are many ways to get the homepage address of official WeChat account from the obtained webpage content, such as beautifulsoup, webdriver, directly using Regularity, pyquery and so on.
Here, pyquery is used to find the entrance address of the homepage of WeChat official account.
【python】? View the plain? copy
# Get the homepage address of official WeChat account
def? get _ wx _ URL _ by _ sou Gou _ search _ html(self,sougou _ search _ html):
Doctor. =? Pq (sougou _ search _html)
# Print? doc('p[class="tit"]')('a ')。 Attribute ('href')
# Print? doc('div[class=img-box]')('a ')。 Attribute ('href')
# Using pyquery to process web content is similar to using beautifulsoup, but the method of pyquery is similar to that of jQuery, and find the home address of official WeChat account.
Return? doc(' div[class = txt-box]')(' p[class = tit]')(' a ')。 Attribute ('href')
4. Get a list of articles on the homepage of official WeChat account.
First of all, you need to load the homepage of official WeChat account. Phantomjs+webdriver is used here, because the content of this homepage needs js rendering and loading, and only static webpage content can be obtained by the previous method.
【python】? View the plain? copy
# use webdriver? Load the contents of the homepage of WeChat official account, mainly the js rendering part.
def? get_selenium_js_html(self,? Website):
Browser? =? webdriver。 PhantomJS()?
Browser.get (web address)?
time.sleep(3)?
#? Execute js to get the whole page content.
html? =? browser.execute_script("return? document . document element . outer html ")
Return? Hypertext markup language
After getting the contents of the home page, we get a list of articles with what we need.
【python】? View the plain? copy
# Get the article content of WeChat official account
def? parse_wx_articles_by_html(self,? selenium_html):
Doctor. =? Pq (selenium html)
Print? "Start looking for content messages"
Return? doc('div[class="weui_media_box? appmsg"]')
# Some official WeChat accounts only have 10, and some may be more.
# Return? Doc ('div [class = "weui _ msg _ card"]') # There are only 10 official accounts on WeChat.
5. Analyze each article list to get the information we need.
Step 6 deal with the corresponding content
Including the article name, address, introduction, publication time, etc.
7. Save the content of the article
Save it locally in html format.
At the same time, save the contents of the previous step in excel format.
8. Save json data
In this way, after each step of splitting, it is not particularly difficult to grab the article of WeChat official account.
Third, the source code
The source code of the first edition is as follows:
【python】? View the plain? copy
#! /usr/bin/python
#? Code:? utf-8
Import? System copy command (short for system)
Reload (system)
sys.setdefaultencoding('utf-8 ')
From where? urllib? Import? cite
From where? pyquery? Import? PyQuery? As? Physical quotient
From where? Selenium? Import? Network driver
Import? ask
Import? time
Import? about
Import? json
Import? Operating System (operating system)
Class? Wechat _ Spider:
def? __init__(self, kW):
? "The builder,"
self.kw? =? kilowatt
#? Sohu wechat search link
#self.sogou_search_url? =? /prestige? type = 1 & amp; Query =% s&ie = utf8 & amp _ sugar _ = n& _ sugar _ type _ ='? %? Quotation (self kilowatt)
self.sogou_search_url? =? /prestige? type = 1 & amp; Query =% s&ie = utf8 & enter & amp = n& _ sugar _ type _ ='? %? Quotation (self kilowatt)
#? Reptile camouflage
self.headers? =? {'User Agent':? Mozilla/5.0? (Windows? NT? 10.0; ? WOW64? rv:47.0)? Gecko /20 100 10 1? FirePHP/0refox/47.0? FirePHP/0.7.4. 1'}
#? Operation timeout duration
Self. Timeout? =? five
self.s? =? Request. Session ()
def? Get_search_result_by_kw (itself):
Self.log ('search address: %s'? %? self.sogou_search_url)
Return? Self.s.get(self. sogou _ search _ website,? headers=self.headers,? timeout=self.timeout)。 content
def? get _ wx _ URL _ by _ sou Gou _ search _ html(self,sougou _ search _ html):
? According to the returned sougou_search_html, get the homepage link of official WeChat account?'
Doctor. =? Pq (sougou _ search _html)
# Print? doc('p[class="tit"]')('a ')。 Attribute ('href')
# Print? doc('div[class=img-box]')('a ')。 Attribute ('href')
# Using pyquery to process web content is similar to using beautifulsoup, but the method of pyquery is similar to that of jQuery, and find the home address of official WeChat account.
Return? doc(' div[class = txt-box]')(' p[class = tit]')(' a ')。 Attribute ('href')
def? get_selenium_js_html(self,? wx_url):
? Execute js rendering content and return rendered html content?'
Browser? =? webdriver。 PhantomJS()?
browser.get(wx_url)?
time.sleep(3)?
#? Execute js to get the whole dom?
html? =? browser.execute_script("return? document . document element . outer html ")
Return? Hypertext markup language
def? parse_wx_articles_by_html(self,? selenium_html):
? "Analysis of WeChat official account articles from selenium_html?"
Doctor. =? Pq (selenium html)
Return? doc(' div[class = " weui _ msg _ card "]')
def? Switch _ arcticles _ to _ list(self, article):
? Turn an article into a data dictionary?
Articles _ List? =? []
Me? =? 1
What if? Article:
For what? Articles? Are you online? articles.items():
self . log(u ' start integration(% d/% d)'? %? (me,? Len (article))
articles _ list . append(self . parse _ one _ article(article))
Me? +=? 1
#? break
Return? Articles _ List
def? parse_one_article(self,? Article):
? Analyze an article?
Article _ dictionary? =? {}
Articles? =? Article ('. weui_media_box[id]')
Title? =? Article ('h4[class="weui_media_title"]'). Text ()
Self.log ('The title is:? "%s"? %? Title)
Website? =? ''? +? Article ('h4[class="weui_media_title"]'). Attribute ('hrefs')
Self.log ('Address:? "%s"? %? Website)
Summary? =? Article ('. Desc media association). Text ()
Self.log ('Introduction to the article:? "%s"? %? Summary)
Dating? =? Article ('. weui _ media _ extra _ info’)。 Text ()
Self.log ('published in:? "%s"? %? Date)
pic? =? Self.parse_cover_pic (article)
Content? =? Self.parse_content_by_url (web address). html()
content file title = self . kw+'/'+title+' _ '+date+'。' html '
Self.save_content_file (content file title, content)
Return? {
Title': Title,
Website': url,
Summary': summary,
Date': Date,
pic ':pic,
Content': Content
}
def? parse_cover_pic(self,? Article):
? Analyze the cover picture of the article?'
pic? =? Article ('. weui _ media _ HD’)。 Attribute ('Style')
p? =? Recompile (r'background-image:url (. )')
rs? =? P.findall (picture)
self.log(? The cover picture is: %s? ? %? rs[0]? What if? len(rs)? & gt? 0? "Otherwise,")
Return? rs[0]? What if? len(rs)? & gt? 0? Or what? ''
def? Parse_content_by_url(self, URL):
? "Do you know the details of the article?"
page_html? =? Self.get_selenium_js_html (URL)
Return? pq(page_html)('#js_content ')
def? Save_content_file (self, title, content):
? Write page content to file?'
With what? Open (title,? w’)? As? Female:
F.write (content)
def? Save_file(self, content):
? Write data to file?'
With what? open(self.kw+'/'+self.kw+'。' txt ',? w’)? As? Female:
F.write (content)
def? Log (yourself,? Monosodium glutamate):
? "Custom logging function?"
Print? u“% s:? "%s"? %? (time.strftime('%Y-%m-%d? %H:%M:%S '),? Monosodium glutamate)
def? Need _ authentication (self, selenium_html):
? Sometimes the other party will block ip. Let's make a judgment here and check whether the html contains the tag with id=verify_change. If so, it means that it has been redirected. "Remind you to try again later."
Return? pq(selenium _ html)(' # verify _ change ')。 text()? ! =? ''
def? Create directory (self):
Create folder'
What if? Isn't it? os.path.exists(self.kw):
os.makedirs(self.kw)?
def? Run (self):
? "Crawler portal function?"
# Step? 0? : Create a folder named after the official WeChat account.
self.create_dir()
#? One step? 1: Get the request for sogou WeChat engine, and use the English name of WeChat official account as the query keyword.
Self.log(u' starts to get, and the English name of WeChat official account is: %s'? %? self.kw)
Self.log(u' start calling sogou search engine')
sougou_search_html? =? self.get_search_result_by_kw()
#? One step? 2. Analyze the homepage link of WeChat official account from the search results page.
Self.log(u' successfully obtained sougou_search_html and started to grab the home page wx_url corresponding to the official WeChat account').
wx_url? =? self . get _ wx _ URL _ by _ Sogou _ search _ html(Sogou _ search _ html)
Self.log(u' successfully obtained wx_url, %s'? %? wx_url)
#? One step? 3.3 3: 3:Selenium+PhantomJs to get the rendered html loaded asynchronously by Js.
Self.log(u' starts calling selenium to render html')
Selenium _html? =? self . get _ selenium _ js _ html(wx _ URL)
#? One step? 4:? Detect whether the target website is blocked.
What if? self.need_verify(selenium_html):
Self.log(u' crawler is blocked by the target website, please try again later')
Otherwise:
#? One step? 5:? Use PyQuery, from step? 3. Parse the data of the article list of WeChat official account from the obtained html.
Self.log(u' calls selenium to render html, and begins to parse the article of WeChat official account).
Articles? =? self . parse _ wx _ articles _ by _ html(selenium _ html)
Self.log (you have %d WeChat articles)? %? Len (article))
#? One step? 6:? Encapsulating WeChat article data into a dictionary list
Self.log(u' began to integrate the data of WeChat articles into a dictionary')
Articles _ List? =? Self.switch _ arcs _ to _ list (article)
#? One step? 7:? Let go? The dictionary list of 5 is converted into Json.
Self.log(u' integration is complete, and conversion to json begins')
data_json? =? json.dumps(articles_list)
#? One step? 8:? Write file
Self.log(u' is converted into json, and json data is saved to file').
self.save_file
Self.log(u' save complete, program ends')
#? primary
What if? __name__? ==? __main__ ':
Gongzhonghao=raw_input(u' climb into official WeChat account').
What if? Isn't it? Gong Zhongyu:
WeChat official account ='python6359'
Weixin_spider (Gong Zhonghao). Run ()
The second version of the code:
The code was optimized and rectified, mainly as follows:
1. Added excel storage.
2. Modify the rules for obtaining the content of the article.
Step 3 enrich the notes
This program has a known flaw: if an article on the official WeChat account contains a video, it may be reported as an error.
【python】? View the plain? copy
#! /usr/bin/python
#? Code:? utf-8