An attempt to write a crawler

One background

I have an idea recently. If I want to get the news / article information in the specified time period, I can simply do a public opinion analysis. Then the most basic is to get the article list first. There are some ready-made interfaces related to public opinion, such as the public opinion monitoring platform of microblog, which is provided with relatively mature APIs; Alibaba cloud and Baidu cloud also have public opinion interfaces. However, due to some factors, or the cost, or the news time range provided by the api itself does not meet expectations, it can not be used directly. Then consider capturing some information through the spider temporarily to support the work content of this time.

II. Public opinion detection

Public opinion monitoring refers to obtaining public opinion information according to keywords, including news, forum, blog, microblog, wechat, post bar, etc. Add here that JD Vientiane of JD cloud is found to be a good api aggregation portal. Taking the public opinion api as an example, it covers a number of services:

The ability of each service provider is basically to collect news information through self capture and interface cooperation, do a good job in channel coverage, and then store it locally for public opinion analysis and provide external results. It's simple to say, but the parts involving retrieval and model are still difficult.

III. information sources

Return to the theme. The first step we need to do is to select an appropriate data source to collect articles. Considering the collection cost, it is a good choice to directly use various search engines / traffic platforms, because as a traffic portal, it has helped us complete the collection of resources from various channels.

However, on the other hand, all large traffic platforms are home to crawlers, and they know all kinds of crawler strategies like the back of their hands. If it is a large number of crawls, it is easier to be found. Fortunately, we only get a small amount of information occasionally, and it is only used for learning and use, which will not cause much traffic impact, so it will not be paid attention to generally. It's really important to have a bottom line and discretion!

IV. content analysis

4.1 search example

Just as there has been another epidemic in Fujian recently, we will use this as the keyword to search:

Link corresponding to the result: https://www.baidu.com/s?wd=%E7%A6%8F%E5%BB%BA%20%E7%96%AB%E6%83%85&rsv_ spt=1&rsv_ iqid=0xff465a7d00029162&issp=1&f=8&rsv_ bp=1&rsv_ idx=2&ie=utf-8&tn=baiduhome_ pg&rsv_ enter=1&rsv_ dl=ib&rsv_ sug3=28&rsv_ sug1=19&rsv_ sug7=101&rsv_ sug2=0&rsv_ btype=i&inputT=6747&rsv_ sug4=11869

4.2 content analysis of search results

Here, we focus on analyzing the web page structure to confirm the parsing method.

Several search results are distributed by:

1. Title( "6 + 18" has been found in total, and one article can understand the epidemic situation and transmission chain in Fujian),

2. Release time: (1 day ago),

3. Content abstract: [case details] Putian City has found 6 confirmed cases and 18 asymptomatic infections. According to Putian CDC, as of 16:00 on the 11th, the epidemic has found 6 confirmed cases and 18 asymptomatic infections. Fujian has added 1 local confirmed case and 4 asymptomatic cases, all of which are reported by Putian City. 9

4. Source: Beijing News

Composition, these are also the elements we want to collect.

4.3 page source code analysis

We view the source code on the page and set the location of the above news as follows:

	        data-click="{
			'F':'778717EA',
			'F1':'9D73F1E4',
			'F2':'4CA6DE6B',
			'F3':'54E5263F',
			'T':'1631514840',
						'y':'FFF5FF4D'
												}"
        href = "http://www.baidu.com/link?url=q7_vtPksHy_0aWRKZN8tfIsAl3bIFwiqMLAh1keliirFxhui2JPtcElwM4pvYz6_IOYVgaozLiQcXmFK4Gc9mT8i-dt3_6WdlXLeE10L0xO"

		            target="_blank"
        
		>Cumulative discovery "6"+18",Read a text<em>Epidemic situation in Fujian</em>Current situation and communication chain</a></h3><div class="c-row c-gap-top-small"><div class="general_image_pic c-span3" style="position:relative;top:2px;"><a class="c-img c-img3 c-img-radius-large" style="height:85px"
          href="http://www.baidu.com/link?url=q7_vtPksHy_0aWRKZN8tfIsAl3bIFwiqMLAh1keliirFxhui2JPtcElwM4pvYz6_IOYVgaozLiQcXmFK4Gc9mT8i-dt3_6WdlXLeE10L0xO"
                target="_blank"
      ><img class="c-img c-img3 c-img-radius-large" src="https://T10. Baidu. COM / it / u = 274264626152697636 & FM = 30 & app = 106 & F = JPEG? W = 312 & H = 208 & S = 31b6e832cf9241e9146191ef00005021 "style =" height: 85px; "/ > < span class =" c-img-border c-img-radius-large "> < / span > < / a > < / div > < div class =" c-span9 c-span-last "> < div class =" c-abstract "> < span class =" newtimefactor_before_abs c-color-gray2 m "> 1 day ago & nbsp</span> [< EM > case < / EM > details] Putian City has found 6 cases of < EM > confirmed < / EM > and 18 cases of asymptomatic infection. According to Putian CDC, as of 16:00 on the 11th, 6 cases of < EM > confirmed < / EM > and 18 cases of asymptomatic infection have been found in the < EM > epidemic < / EM > < EM > < style >. User avatar{
	display: flex;
	flex-direction: row;
	align-items: center;
	justify-content: flex-start;
}</style><div class="f13 c-gap-top-xsmall se_st_footer user-avatar"><a target="_blank" href="http://www.baidu.com/link?url=q7_vtPksHy_0aWRKZN8tfIsAl3bIFwiqMLAh1keliirFxhui2JPtcElwM4pvYz6_IOYVgaozLiQcXmFK4Gc9mT8i-dt3_6WdlXLeE10L0xO" class="c-showurl c-color-gray" style="text-decoration:none;position:relative;"><div class="c-img c-img-circle c-gap-right-xsmall" style="display: inline-block;width: 16px;height: 16px;position: relative;top: 3px;vertical-align:top;"><span class="c-img-border c-img-source-border c-img-radius-large"></span><img src="https://pic.rmb.bdstatic.com/9da74a517eb1befeba93a5f3167cc74b.jpeg"></div><style>.nor-src-icon-v {display: inline-block;width: 10px;height: 10px;border-radius: 100%;position: absolute;left: 7px;bottom: -1px;background-image: url(https://b.bdstatic.com/searchbox/icms/searchbox/img/yellow-v.png);background-size: 10px 10px;}
				.nor-src-icon-v.vicon-1 {background-image: url(https://b.bdstatic.com/searchbox/icms/searchbox/img/red-v.png);}
				.nor-src-icon-v.vicon-2 {background-image: url(https://b.bdstatic.com/searchbox/icms/searchbox/img/blue-v.png);}
				.nor-src-icon-v.vicon-3 {background-image: url(https://b. Bdstatic. COM / searchbox / ICMs / searchbox / img / yellow-v.png);} < / style > < span class = "nor-src-icon-v icon-2" > < / span > Beijing News < / span > < / a > < div class = "c-tools c-gap-left" id = "tools_11222397331129245369_10" data tools = '{"title": "cumulative discovery of" 6 + 18 ", read the epidemic situation and transmission chain in Fujian", "URL":“ http://www.baidu.com/link?url=q7_vtPksHy_0aWRKZN8tfIsAl3bIFwiqMLAh1keliirFxhui2JPtcElwM4pvYz6_IOYVgaozLiQcXmFK4Gc9mT8i -dt3_6WdlXLeE10L0xO"}'><i class="c-icon f13" ></i></div><span class="c-icons-outer"><span class="c-icons-inner"></span></span><style>.snapshoot, .snapshoot:visited {
        color: #9195A3!important;
    }
    .snapshoot:active, .snapshoot:hover {
        color: #626675!important;
    }</style><a data-click="{'rsv_snapshot':'1'}" href="http://cache.baiducontent.com/c?m=3KZKMrJKhK1otETy33lbReLfsu2eWoaOfZYd2MNWUY3xbKNxMJNNwt_CsOHD6e7Qgf2Tu0GsMgHlSCO1_urn2_JjqYPGu6wMAk4gekije3KTYWOhsyDxmgTxtXJYJJMOij3XKVONqycqZ7hhjK7jNCazerlWVPWh5X0RiBQJlrbX68JcbomnhpiL-nT2Mc-T&p=882a930085cc43fd1cb9d1284e&newp=8b2a970d86cc47f719a28a285f53d836410eed643ac3864e1290c408d23f061d4863e1b923271101d5ce7f6606af4359e1f2337323454df6cc8a871d81edda&s=cfcd208495d565ef&user=baidu&fm=sc&query=%B8%A3%BD%A8+%D2%DF%C7%E9&qid=b3c763fd0003c0d6&p1=10"
                        target="_blank"
                    class="m c-gap-left c-color-gray kuaizhao snapshoot">Baidu snapshot</a></div></div></div></div>

The next step is to process the front and back marks of each element, and intercept the required content through the corresponding front and back expressions.

V. source code analysis

5.1 user_agents settings

#Set multiple user_agents to prevent Baidu from limiting IP addresses
user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0', \
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0', \
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \
    (KHTML, like Gecko) Element Browser 5.0', \
    'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \
    'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \
    'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \
    Version/6.0 Mobile/10A5355d Safari/8536.25', \
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
    Chrome/28.0.1468.0 Safari/537.36', \
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']

5.2 file content of search results (html)

Here's a point. pn represents the pagination of search results. The number of pages is 10, so the transmission parameters need to be 0, 10, 20,..., remember.

def baidu_search(keyword,pn):
  p= {'wd': keyword}
  res=urllib2.urlopen(("http://www.baidu.com/s?"+urllib.urlencode(p)+"&pn={0}&cl=3&rn=100").format(pn))
  html=res.read()
  return html

5.3 some tools and methods

def getList(regex,text):
  arr = []
  res = re.findall(regex, text)
  if res:
    for r in res:
      arr.append(r)
  return arr
def getMatch(regex,text):
  res = re.findall(regex, text)
  if res:
    return res[0]
  return ""
def clearTag(text):
  p = re.compile(u'<[^>]+>')
  retval = p.sub("",text)
  return retval
def write2File(path, content):
  f = open(path,'w')
  writer = csv.writer(f)
  writer.writerow(content)
  # 5. Close the file
  f.close()

5.4 parsing rules and writing results to files

def geturl(keyword):
  newsList = []
  f = open('result-'+keyword+'.csv','w')
  writer = csv.writer(f)

  for page in range(60):
    pn=page*10
    html = baidu_search(keyword,pn)
    print pn
    # write2File('search results - '+ str(pn)+'.html', html)
    time.sleep(2)
    content = unicode(html, 'utf-8','ignore')
    itemList = content.split('content_left')
    # # print html
    for item in itemList:
      if item.find('class=\"m c-gap-left c-color-gray')>=0:
        linkList = re.findall(r"newTimeFactor_before_abs(.+)font", item)
        for link in linkList:
          news_date = re.findall(r"span style=\"color: #9195A3;\">(.+)<\/span>&nbsp;-&nbsp;", link)
          title = re.findall(r"\"title\":\"(.+)\",\"url", link)
          url = re.findall(r"url\":\"(.+)\"}\'>", link)
          source = re.findall(r"class=\"c-showurl\">(.+)<\/span><div", link)
          cars = [title[0], news_date[0], source[0], url[0]]
          newsList.append(cars)
          print title[0]
          writer.writerow(cars)

    itemList = content.split('newTimeFactor_before_abs c-color-gray2')

    i = 0
    for link in itemList:
      if i == 0:
        i=i+1
        continue;
      if link.find('general_image_pic')>=0:
        # print link
        news_date = re.findall(r" m\">(.+)&nbsp;</span>", link)
        
        source = re.findall(r".jpeg\"><\/div>(.+)<\/span><\/a><div ", link)
        if len(source) <= 0:
          source = re.findall(r"<span class=\"nor-src-icon-v vicon-2\"><\/span>(.+)<\/span><\/a><div", link)
        title = re.findall(r"\"title\":\"(.+)\",\"url", link)
        
        url = re.findall(r"url\":\"(.+)\"}\'>", link)

        if len(source) > 0:
          cars = [title[0], news_date[0], source[0], url[0]]
          newsList.append(cars)
          print title[0]
          writer.writerow(cars)
        i=i+1

  # 5. Close the file
  f.close()
  time.sleep(5)

5.5 execution entrance

if __name__=='__main__':
  arr = ['Key words 1','Key words 2']
  for keyword in arr:
  geturl(keyword)

Vi. summary

At this point, a tentative spider is written. If you want to get more code or maintain technical communication, you can pay attention to the official account.

Tags: Database Big Data crawler architecture

Posted on Tue, 14 Sep 2021 00:32:35 -0400 by raker7