Python crawler tutorial - Sina Weibo distributed crawler sharing

Reptile function:

In this project, Sina Weibo crawlers on a single computer are reconstructed into distributed crawlers.

The Master machine only manages task scheduling, regardless of climbing data; the slave machine only manages to throw the Request to the Master machine, and then take it from the Master machine when the Request is needed.

Environment and Architecture:

Development language: Python 2.7

Development environment: 64 bit Windows 8 system, 4G memory, i7-3612QM processor.

Database: MongoDB 3.2.0, Redis 3.0.501

(Python editor: Pycharm; MongoDB management tool: mongoboster; Redis management tool: RedisStudio)

The crawler framework uses Scrapy, and uses Scrapy ﹣ Redis and Redis to realize distributed.

In the distributed system, one machine acts as the Master, and Redis is installed for task scheduling. The other machines act as Slaver, just taking tasks from the Master to climb. The principle is: when Slaver runs, when it encounters a Request, it is not handed over to the spider to climb, but to the Redis database on the Master. The requests that the spider needs to climb are all taken from Redis, and Redis receives the Request, first goes to the database, and then saves it again. Which Slaver needs to Request and then gives it, so as to achieve task coordination.

instructions:

Python needs to install scratch, pymongo, json, base64 and requests.

The Master machine only needs to install redis (large memory requirements). The slave machine needs to install the python environment and MongoDB to store data. If you want to store all the data on a machine, you can directly change the IP address of MongoDB in the pipeline, or it is recommended to build a MongoDB cluster. Redis and MongoDB can be installed without configuration.

Add the microblog account and password you used to log in to the cookies.py file, which already has two accounts as the format reference.

You can modify the setting settings in the summary, such as interval, log level, redis IP, and so on.

Run Begin.py after the above configuration. Repeat that the Master machine does not need to run programs, its function is to use Redis for task scheduling. Slaver runs a crawler and adds a slaver. All you need to do is set up the python environment and MongoDB, and then copy the code and run it directly.

Project source code

 1 # encoding=utf-8
 2 import json
 3 import base64
 4 import requests
 5  
 6 """
 7 Enter your micro blog account and password, and you can go to Taobao to buy seven for one yuan.
 8 It is suggested to buy dozens of microblogs. Microblogs are very anti pickpocketing. If they are too frequent, there will be 302 transfers.
 9 Or you can increase the time interval.
10 """
11 myWeiBo = [
12     {'no': 'jiadieyuso3319@163.com', 'psw': 'a123456'},
13     {'no': 'shudieful3618@163.com', 'psw': 'a123456'},
14 ]
15  
16  
17 def getCookies(weibo):
18     """ Obtain Cookies """
19     cookies = []
20     loginURL = r'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)'
21     for elem in weibo:
22         account = elem['no']
23         password = elem['psw']
24         username = base64.b64encode(account.encode('utf-8')).decode('utf-8')
25         postData = {
26             "entry": "sso",
27             "gateway": "1",
28             "from": "null",
29             "savestate": "30",
30             "useticket": "0",
31             "pagerefer": "",
32             "vsnf": "1",
33             "su": username,
34             "service": "sso",
35             "sp": password,
36             "sr": "1440*900",
37             "encoding": "UTF-8",
38             "cdult": "3",
39             "domain": "sina.com.cn",
40             "prelt": "0",
41             "returntype": "TEXT",
42         }
43         session = requests.Session()
44         r = session.post(loginURL, data=postData)
45         jsonStr = r.content.decode('gbk')
46         info = json.loads(jsonStr)
47         if info["retcode"] == "0":
48             print "Get Cookie Success!( Account:%s )" % account
49             cookie = session.cookies.get_dict()
50             cookies.append(cookie)
51         else:
52             print "Failed!( Reason:%s )" % info['reason']
53     return cookies
54  
55  
56 cookies = getCookies(myWeiBo)
57 print "Get Cookies Finish!( Num:%d)" % len(cookies)

On the way to becoming an excellent Python development engineer, I am full of sweat and hard work

What can beginners do for Python development? What route should they follow when they learn? What direction should they develop after learning? To learn more about python, you can copy Youdao cloud, link to a browser, open the system, learn from: http://note.youdao.com/noteshare?id=e4fa02e7b56d7909a27674cdb3da08aa

Development tools, learning materials, etc. are shared, and professional teachers share online free live Q & A!

Tags: Python Redis MongoDB Session

Posted on Sun, 09 Feb 2020 02:47:06 -0500 by soul