The third big job
Assignment 1
1.1 experimental topic
-
Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 3 digits of the student number)
-
Output information:
Output the downloaded Url information on the console, store the downloaded image in the images subfolder, and give a screenshot.
1.2 single thread writing ideas
1.2.1 get all links to the home page of the weather network
Because we want to turn the page, we began to consider turning the page from the weather pages of different regions, but it is very troublesome to get the number of each region to turn the page, and there are many same pictures at the same time.
So finally consider getting links to other pages from the home page and getting pictures from them
start_url = "http://www.weather.com.cn/" data = getHTMLText(start_url) html=etree.HTML(data) #Find the href attribute in all a under the home page urlli = html.xpath("//a/@href")
Because there are links in the urlli that are not other pages, such as:
Therefore, use the following code for filtering:
urls=[] for url in urlli: if(url[0:7]=="http://"): urls.append(url)
1.2.2 get picture links from links
Extract the picture link in each link, and then remove the same link. Every time a new picture link appears, call the method of downloading the picture to download the picture.
images = soup.select("img") for image in images: try: src = image["src"] url = urllib.request.urljoin(start_url,src) if url not in urls: count += 1 if count>15: break urls.append(url) print(url) download(url,count) except Exception as err: print(err)
1.2.3 picture download
file_name = url.split('/')[-1] rep = requests.get(url) with open("images\\"+ file_name, 'wb') as f: f.write(rep.content) f.close() print("downloaded " + file_name )
1.2.4 results
The console outputs a picture link and a reminder of successful download.
Results in the local folder images.
1.3 multithreading programming ideas
1.3.1 difference between multithreading and single thread
The main reason is that there must be child processes, so threads = [] is used to store processes and the use part. The code is as follows:
T = threading.Thread(target=download, args=(url,)) T.setDaemon(False) T.start() threads.append(T)
The rest is consistent with single threading.
1.3.2 results
The console outputs a picture link and a reminder of successful download.
Results in the local folder images.
1.4 complete code
https://gitee.com/q_kj/crawl_project/tree/master/third
1.5 summary
The main problem that bothers me at the beginning of this problem is how to turn the page. If turning the page of the regional weather interface is too complex, finally, extract other page links from the main interface for crawling. Other parts can be modified by referring to the code in the class.
Assignment 2
2.1 experimental topics
-
Requirements: use the sketch framework to reproduce the operation ①.
-
Output information:
Same as operation 1
2.2 ideas
2.2.1 items.py
class WeatherItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() imag = scrapy.Field()
imag is a link to store pictures.
2.2.2MySpider.py
The parse function is mainly used to get links to all other pages of the main page.
urls=[] #First find the herf attribute under all a tags on the home page urllib = response.xpath("//a/@href").extract() for url in urllib: #Because the results of some herf attributes are not links, they are filtered and stored in urls if (url[0:7] == "http://"): if url not in urls: urls.append(url) #Call the image crawling function for each link for url in urls: if self.count<135: yield scrapy.Request(url=url,callback=self.imageSpider)
The imageSpider function is mainly used to get links to all pictures on the page.
imgurl=[] #Extract the picture link in the src attribute in each img tag img_url = reponse.xpath("//img//@src").extract() #if self.count>=135: #return for url in img_url: #Not sure if each is a link, it is filtered if (url[0:7] == "http://"): #Exclude duplicate labels if url not in imgurl: self.count += 1 #If the picture is larger than 135, it cannot be added f self.count < 135: imgurl.append(url) for url in imgurl: item = WeatherItem() item["imag"] = url yield item
2.2.3 pipelines.py
Open spideropen_spider:
def open_spider(self,spider): print("opened") self.opened=True self.count=0
Close spiderclose_spider:
def close_spider(self,spider): if self.opened: self.opened=False print("closed") print("Total crawling",self.count,"Picture")
The appeal part rewrites the code in the ppt given by the teacher.
process_item enables image download:
self.count = self.count + 1 img=item["imag"] print(img) #if self.count>135: #self.close_spider() if self.opened: try: #Download pictures to a local folder rep = requests.get(img) file_name = img.split('/')[-1] with open("images\\" + file_name, 'wb') as f: f.write(rep.content) f.close() print("downloaded " + file_name)
2.2.4 results
Console information:
Results in the images folder:
2.3 complete code
https://gitee.com/q_kj/crawl_project/tree/master/weather
2.4 summary
The second question is very similar to the first question. The second question mainly lies in the preparation of various documents in the framework of the story.
Assignment 3
3.1 experimental topic
-
Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the database
imgs path.
-
Candidate sites: https://movie.douban.com/top250
-
Output information:
Serial number Movie title director performer brief introduction Film rating Film cover 1 The Shawshank Redemption Frank delabond Tim Robbins Want to set people free 9.7 ./imgs/xsk.jpg 2....
3.2 ideas
3.2.1 item.py
class DoubanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() rank = scrapy.Field() movie_name = scrapy.Field() director = scrapy.Field() actors = scrapy.Field() brief_introduction = scrapy.Field() grade = scrapy.Field() imag = scrapy.Field()
3.2.2 spider.py
Extract the contents of each part through xpath:
rank = response.xpath("//div[@class='pic']/em/text()").extract() # ranking name = response.xpath("//div[@class='hd']/a/span[1]/text()").extract() # movie name dir_acts = response.xpath("//div[@class='bd']/p[1]/text()").extract() # because the director and actor information are mixed together, it needs to be extracted again brief_intro = response.xpath("//div[@class='bd']/p[2]/span/text()").extract() # introduction grade = response.xpath("//div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract() # score img = response.xpath("//div[@class='pic']/a/img/@src").extract() # movie cover picture link
Then the problem is that the director and starring information obtained above are mixed together, as well as other information, as shown in the following figure:
Therefore, it needs to be extracted again:
dname = [] #Used to save the director's name aname = [] #Used to save the starring name count = 0 #Used to distinguish dir later_ Parity in acts #Find the dir of every movie_ Act is in two parts. The first part is the director and starring, and the second part is the region and date #So consider extracting dir_ The even part of acts, that is, the director and starring for dir_act in dir_acts: count += 1 if (count % 2 != 0): #There is "\ xa0\xa0\xa0" between the director and the leading actor, so it is divided into two parts dir_act = dir_act.strip().split('\xa0\xa0\xa0') #The extracted first part is removed and stored in dname dname.append(dir_act[0].split('director:')[1]) # The extracted second part is removed and stored in aname aname.append(dir_act[1].split('to star:')[1]) elif (count % 2 == 0): continue
Then save the final result into item:
item = DoubanItem() item["rank"] = int(rank[i]) item["movie_name"] = name[i] item["director"] = dname[i] item["actors"] = aname[i] item["brief_introduction"] = brief_intro[i] item["grade"]=grade[i] item["imag"] = img[i] yield item
Finally, turn the page:
#Turn the page, https://movie.douban.com/top250?start= (?) & filter =,? Is exactly a multiple of 25 to turn pages url_f="https://movie.douban.com/top250" for i in range(1,10): url=url_f+"?start="+str(i*25)+"&filter=" #Get the url of the next page, call back parse and continue the above operation yield scrapy.Request(url=url,callback=self.parse)
3.2.3pipelines.py
open_spider: open the spider and create the database and table at the same time.
def open_spider(self,spider): print("opened") try: #Connect to the database and create a doublan database self.con = sqlite3.connect("douban.db") self.cursor = self.con.cursor() #Create table self.cursor.execute("create table douban(ranking int(4),Movie name varchar(25),director varchar(30),to star varchar(70),brief introduction varchar(70),score varchar(8),cover varchar(80))") self.opened = True except Exception as err: print(err) self.opened=False
close_spider: used to close the spider.
def close_spider(self,spider): if self.opened: self.con.commit() self.con.close() self.opened=False
process_item: used to insert data and download pictures.
#Insert data from item into the table self.cursor.execute("insert into douban(ranking,Movie name,director,to star,brief introduction,score,cover)values(?,?,?,?,?,?,?)",(item['rank'],item['movie_name'],item['director'],item['actors'],item['brief_introduction'],item['grade'],item['imag'])) #Save movie cover in local folder rep = requests.get(img) print(img) #Set the cover name to the movie name file_name = item['movie_name'] with open("images\\" + file_name+".jpg", 'wb') as f: f.write(rep.content) f.close() print("downloaded " + file_name)
3.2.4 results
Console information:
Database results:
images local folder photo results:
3.3 complete code
https://gitee.com/q_kj/crawl_project/tree/master/douban
3.4 summary
Because the previous code has been crawling unsuccessfully, it has been crawling for many times. Finally, it is prompted that an abnormal request is detected from your IP, and then log in to Douban. Although there is no reminder, it displays crawled (403) < get https://movie.douban.com/top250 > (referer: None) , according to https://www.cnblogs.com/guanguan-com/p/13540188.html The method in this link has been successfully solved. Therefore, when crawling, you must pay attention to: crawl part of the page content first and test the correctness. There is no problem. For online crawling, you should set a waiting time and limit the speed.