The third major operation of data acquisition

The third big job

Assignment 1

1.1 experimental topic

  • Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 3 digits of the student number)

  • Output information:

    Output the downloaded Url information on the console, store the downloaded image in the images subfolder, and give a screenshot.

1.2 single thread writing ideas

1.2.1 get all links to the home page of the weather network

​ Because we want to turn the page, we began to consider turning the page from the weather pages of different regions, but it is very troublesome to get the number of each region to turn the page, and there are many same pictures at the same time.

​ So finally consider getting links to other pages from the home page and getting pictures from them

start_url = "http://www.weather.com.cn/"
data = getHTMLText(start_url)
html=etree.HTML(data)
#Find the href attribute in all a under the home page
urlli = html.xpath("//a/@href")

​ Because there are links in the urlli that are not other pages, such as:

​ Therefore, use the following code for filtering:

urls=[]
for url in urlli:
    if(url[0:7]=="http://"):
        urls.append(url)

1.2.2 get picture links from links

​ Extract the picture link in each link, and then remove the same link. Every time a new picture link appears, call the method of downloading the picture to download the picture.

images = soup.select("img")
    for image in images:
        try:
            src = image["src"]
            url = urllib.request.urljoin(start_url,src)
            if url not in urls:
                count += 1
                if count>15:
                    break
                urls.append(url)
                print(url)
                download(url,count)
        except Exception as err:
            print(err)

1.2.3 picture download

file_name = url.split('/')[-1]
rep = requests.get(url)
with open("images\\"+ file_name, 'wb') as f:
     f.write(rep.content)
f.close()
 print("downloaded " +  file_name )

1.2.4 results

The console outputs a picture link and a reminder of successful download.

Results in the local folder images.

1.3 multithreading programming ideas

1.3.1 difference between multithreading and single thread

​ The main reason is that there must be child processes, so threads = [] is used to store processes and the use part. The code is as follows:

T = threading.Thread(target=download, args=(url,))
T.setDaemon(False)
T.start()
threads.append(T)

​ The rest is consistent with single threading.

1.3.2 results

The console outputs a picture link and a reminder of successful download.

Results in the local folder images.

1.4 complete code

https://gitee.com/q_kj/crawl_project/tree/master/third

1.5 summary

​ The main problem that bothers me at the beginning of this problem is how to turn the page. If turning the page of the regional weather interface is too complex, finally, extract other page links from the main interface for crawling. Other parts can be modified by referring to the code in the class.

Assignment 2

2.1 experimental topics

  • Requirements: use the sketch framework to reproduce the operation ①.

  • Output information:

    Same as operation 1

2.2 ideas

2.2.1 items.py

class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    imag = scrapy.Field()

​ imag is a link to store pictures.

2.2.2MySpider.py

​ The parse function is mainly used to get links to all other pages of the main page.

urls=[]
#First find the herf attribute under all a tags on the home page
urllib = response.xpath("//a/@href").extract()
    for url in urllib:
#Because the results of some herf attributes are not links, they are filtered and stored in urls
        if (url[0:7] == "http://"):
            if url not in urls:
                urls.append(url)
        #Call the image crawling function for each link
    for url in urls:
        if self.count<135:
            yield scrapy.Request(url=url,callback=self.imageSpider)

​ The imageSpider function is mainly used to get links to all pictures on the page.

imgurl=[]
        #Extract the picture link in the src attribute in each img tag
img_url = reponse.xpath("//img//@src").extract()
        #if self.count>=135:
            #return
for url in img_url:
        #Not sure if each is a link, it is filtered
    if (url[0:7] == "http://"):
            #Exclude duplicate labels
         if url not in imgurl:
            self.count += 1
                #If the picture is larger than 135, it cannot be added
            f self.count < 135:
                imgurl.append(url)
for url in imgurl:
    item = WeatherItem()
    item["imag"] = url
    yield item

2.2.3 pipelines.py

​ Open spideropen_spider:

    def open_spider(self,spider):
        print("opened")
        self.opened=True
        self.count=0

​ Close spiderclose_spider:

    def close_spider(self,spider):
        if self.opened:
            self.opened=False
            print("closed")
            print("Total crawling",self.count,"Picture")

​ The appeal part rewrites the code in the ppt given by the teacher.

​ process_item enables image download:

self.count = self.count + 1
img=item["imag"]
print(img)
#if self.count>135:
    #self.close_spider()
if self.opened:
    try:
         #Download pictures to a local folder
         rep = requests.get(img)
         file_name = img.split('/')[-1]
         with open("images\\" + file_name, 'wb') as f:
             f.write(rep.content)
         f.close()
         print("downloaded " + file_name)

2.2.4 results

​ Console information:

​ Results in the images folder:

2.3 complete code

https://gitee.com/q_kj/crawl_project/tree/master/weather

2.4 summary

​ The second question is very similar to the first question. The second question mainly lies in the preparation of various documents in the framework of the story.

Assignment 3

3.1 experimental topic

  • Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the database

    imgs path.

  • Candidate sites: https://movie.douban.com/top250

  • Output information:

    Serial number Movie title director performer brief introduction Film rating Film cover
    1 The Shawshank Redemption Frank delabond Tim Robbins Want to set people free 9.7 ./imgs/xsk.jpg
    2....

3.2 ideas

3.2.1 item.py

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    rank = scrapy.Field()
    movie_name = scrapy.Field()
    director = scrapy.Field()
    actors = scrapy.Field()
    brief_introduction = scrapy.Field()
    grade = scrapy.Field()
    imag = scrapy.Field()

3.2.2 spider.py

​ Extract the contents of each part through xpath:

rank = response.xpath("//div[@class='pic']/em/text()").extract() # ranking
name = response.xpath("//div[@class='hd']/a/span[1]/text()").extract() # movie name
dir_acts = response.xpath("//div[@class='bd']/p[1]/text()").extract() # because the director and actor information are mixed together, it needs to be extracted again
brief_intro = response.xpath("//div[@class='bd']/p[2]/span/text()").extract() # introduction
grade = response.xpath("//div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract() # score
img = response.xpath("//div[@class='pic']/a/img/@src").extract() # movie cover picture link

​ Then the problem is that the director and starring information obtained above are mixed together, as well as other information, as shown in the following figure:

​ Therefore, it needs to be extracted again:

dname = [] #Used to save the director's name
aname = [] #Used to save the starring name
count = 0 #Used to distinguish dir later_ Parity in acts
#Find the dir of every movie_ Act is in two parts. The first part is the director and starring, and the second part is the region and date
#So consider extracting dir_ The even part of acts, that is, the director and starring
for dir_act in dir_acts:
     count += 1
     if (count % 2 != 0):
      	#There is "\ xa0\xa0\xa0" between the director and the leading actor, so it is divided into two parts
        dir_act = dir_act.strip().split('\xa0\xa0\xa0')
        #The extracted first part is removed and stored in dname
        dname.append(dir_act[0].split('director:')[1])
        # The extracted second part is removed and stored in aname
        aname.append(dir_act[1].split('to star:')[1])
     elif (count % 2 == 0):
        continue

​ Then save the final result into item:

item = DoubanItem()
item["rank"] = int(rank[i])
item["movie_name"] = name[i]
item["director"] = dname[i]
item["actors"] = aname[i]
item["brief_introduction"] = brief_intro[i]
item["grade"]=grade[i]
item["imag"] = img[i]
yield item

​ Finally, turn the page:

#Turn the page, https://movie.douban.com/top250?start= (?) & filter =,? Is exactly a multiple of 25 to turn pages
url_f="https://movie.douban.com/top250"
for i in range(1,10):
    url=url_f+"?start="+str(i*25)+"&filter="
    #Get the url of the next page, call back parse and continue the above operation
    yield scrapy.Request(url=url,callback=self.parse)

3.2.3pipelines.py

​ open_spider: open the spider and create the database and table at the same time.

    def open_spider(self,spider):
        print("opened")
        try:
            #Connect to the database and create a doublan database
            self.con = sqlite3.connect("douban.db")
            self.cursor = self.con.cursor()
            #Create table
            self.cursor.execute("create table douban(ranking int(4),Movie name varchar(25),director varchar(30),to star varchar(70),brief introduction varchar(70),score varchar(8),cover varchar(80))")
            self.opened = True
        except Exception as err:
            print(err)
            self.opened=False

​ close_spider: used to close the spider.

def close_spider(self,spider):
    if self.opened:
    self.con.commit()
    self.con.close()
    self.opened=False

process_item: used to insert data and download pictures.

 #Insert data from item into the table
 self.cursor.execute("insert into douban(ranking,Movie name,director,to star,brief introduction,score,cover)values(?,?,?,?,?,?,?)",(item['rank'],item['movie_name'],item['director'],item['actors'],item['brief_introduction'],item['grade'],item['imag']))
#Save movie cover in local folder
rep = requests.get(img)
print(img)
#Set the cover name to the movie name
file_name = item['movie_name']
with open("images\\" + file_name+".jpg", 'wb') as f:
    f.write(rep.content)
f.close()
print("downloaded " + file_name)

3.2.4 results

Console information:

Database results:

images local folder photo results:

3.3 complete code

https://gitee.com/q_kj/crawl_project/tree/master/douban

3.4 summary

​ Because the previous code has been crawling unsuccessfully, it has been crawling for many times. Finally, it is prompted that an abnormal request is detected from your IP, and then log in to Douban. Although there is no reminder, it displays crawled (403) < get https://movie.douban.com/top250 > (referer: None) , according to https://www.cnblogs.com/guanguan-com/p/13540188.html The method in this link has been successfully solved. Therefore, when crawling, you must pay attention to: crawl part of the page content first and test the correctness. There is no problem. For online crawling, you should set a waiting time and limit the speed.

Posted on Fri, 05 Nov 2021 13:18:35 -0400 by Gokul