Python multithreaded crawler real combat case: crawling and collection of information materials of major anchors

 

preface

The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

The top list, a website that gathers information and information of the anchor, has a complete range of content. Nowadays, the live broadcast is hot. If you want to find all kinds of anchor information, such websites can collect relevant hot anchor information.

 

Target website:

http://www.toubang.tv/baike/list/20.html

 

 

List page, and the list page rule is not found temporarily, encrypted?

http://www.toubang.tv/baike/list/20.html?p=hJvm3qMpTkj7J/RNmtAVNw==
http://www.toubang.tv/baike/list/20.html?p=rjaUfcMsOOYXKBBBp5YUUA==

 

Obviously, the parameter after p is the page number, but I don 't know how to realize a string.

There is not too much research, Overlord hard bow, hard to do it!

Directly traverse all the list pages to get page links. Here I simply use the recursive function

Get the set of all the list pages. For de duplication, set() is used to convert the set to set directly

Recursive code

def get_apgeurls(apgeurls):
    page_urls=[]
    for apgeurl in apgeurls:
        page_url=get_pageurl(apgeurl)
        page_urls.extend(page_url)


    page_urls=set(page_urls)
    #print(len(page_urls))


    if len(page_urls) < 66:
        return get_apgeurls(page_urls) #Sprocket
    else:
        return page_url

 

Fortunately, there are not many pages, which is a rather stupid implementation method. Pay attention to the use of return. The recursive function calls the function itself, and return will return None. Here, we can get the solution through Baidu's query of relevant information.

The other content is normal operation, so it will not be explained here!

Mention multithreading!

def get_urllists(urls):
    threads = []
    for url in urls:
        t=threading.Thread(target=get_urllist,args=(url,))
        threads.append(t)
    for i in threads:
        i.start()
    for i in threads:
        i.join()


    print('>>> Acquisition link completed!')

 

When a parameter is called, args=(url,) and multithreading are used at the same time. It's a headache to collect and report errors. Basically, the server can't respond. Do you still need to use the Scrapy framework to grab it in a wide range.

Operation effect:

 

 

Source code reference attached:

from fake_useragent import UserAgent
import requests,time,os
from lxml import etree
import threading  #Multithreading


def ua():
    ua=UserAgent()
    headers={'User-Agent':ua.random}
    return headers

def get_pageurl(url):
    pageurl=[]
    html=requests.get(url,headers=ua()).content.decode('utf-8')
    time.sleep(1)
    req=etree.HTML(html)
    pagelists=req.xpath('//div[@class="row-page tc"]/a/@href')
    for pagelist in pagelists:
        if "baike" in pagelist:
            pagelist=f"http://www.toubang.tv{pagelist}"
            pageurl.append(pagelist)

    #print(len(pageurl))
    return pageurl


def get_apgeurls(apgeurls):
    page_urls=[]
    for apgeurl in apgeurls:
        page_url=get_pageurl(apgeurl)
        page_urls.extend(page_url)

    page_urls=set(page_urls)
    #print(len(page_urls))

    if len(page_urls) < 5:
    #if len(page_urls) < 65:
        return get_apgeurls(page_urls) #Sprocket
    else:
        return page_urls


def get_urllist(url):
    html = requests.get(url, headers=ua()).content.decode('utf-8')
    time.sleep(1)
    req = etree.HTML(html)
    hrefs=req.xpath('//div[@class="h5 ellipsis"]/a/@href')
    print(hrefs)
    for href in hrefs:
        href=f'http://www.toubang.tv{href}'
        get_info(href)




def get_urllists(urls):
    threads = []
    for url in urls:
        t=threading.Thread(target=get_urllist,args=(url,))
        threads.append(t)
    for i in threads:
        i.start()
    for i in threads:
        i.join()

    print('>>> Acquisition link completed!')


def get_info(url):
    html = requests.get(url, headers=ua()).content.decode('utf-8')
    time.sleep(1)
    req = etree.HTML(html)
    name=req.xpath('//div[@class="h3 ellipsis"]/span[@class="title"]/text()')[0]
    os.makedirs(f'{name}/', exist_ok=True)  # Create directory
    briefs=req.xpath('//dl[@class="game-tag clearfix"]/dd/span//text()')
    brief_img=req.xpath('//div[@class="i-img fl mr20"]/img/@src')[0].split('=')[1]
    print(name)
    print(briefs)
    print(brief_img)
    down_img(brief_img, name)
    informations=req.xpath('//table[@class="table-fixed table-hover hot-search-play"]/tbody/tr[@class="baike-bar"]/td//text()')
    for information in informations:
        if '\r' and '\n' and '\t' not in information:
            print(information)

    text=req.xpath('//div[@class="text-d"]/p//text()')
    print(text)
    text_imgs=req.xpath('//div[@id="wrapBox1"]/ul[@id="count1"]/li/a[@class="img_wrap"]/@href')
    print(text_imgs)
    threads=[]
    for text_img in text_imgs:
        t=threading.Thread(target=down_img,args=(text_img,name))
        threads.append(t)

    for i in threads:
        i.start()
    for i in threads:
        i.join()

    print("Picture download complete!")



def down_img(img_url,name):
    img_name=img_url.split('/')[-1]
    r=requests.get(img_url,headers=ua(),timeout=8)
    time.sleep(2)
    with open(f'{name}/{img_name}','wb') as f:
        f.write(r.content)
    print(f'>>>preservation{img_name}Picture successful!')



def main():
    url = "http://www.toubang.tv/baike/list/20.html?p=hJvm3qMpTkjm8Rev+NDBTw=="
    apgeurls = [url]
    page_urls = get_apgeurls(apgeurls)
    print(page_urls)
    get_urllists(page_urls)



if __name__=='__main__':
    main()

Tags: Python

Posted on Sat, 13 Jun 2020 05:16:56 -0400 by pushpendra.php