# preface

The top list, a website that gathers information and information of the anchor, has a complete range of content. Nowadays, the live broadcast is hot. If you want to find all kinds of anchor information, such websites can collect relevant hot anchor information.

Target website:

`http://www.toubang.tv/baike/list/20.html`

List page, and the list page rule is not found temporarily, encrypted?

```http://www.toubang.tv/baike/list/20.html?p=hJvm3qMpTkj7J/RNmtAVNw==
http://www.toubang.tv/baike/list/20.html?p=rjaUfcMsOOYXKBBBp5YUUA==```

Obviously, the parameter after p is the page number, but I don 't know how to realize a string.

There is not too much research, Overlord hard bow, hard to do it!

Directly traverse all the list pages to get page links. Here I simply use the recursive function

Get the set of all the list pages. For de duplication, set() is used to convert the set to set directly

Recursive code

```def get_apgeurls(apgeurls):
page_urls=[]
for apgeurl in apgeurls:
page_url=get_pageurl(apgeurl)
page_urls.extend(page_url)

page_urls=set(page_urls)
#print(len(page_urls))

if len(page_urls) < 66:
return get_apgeurls(page_urls) #Sprocket
else:
return page_url```

Fortunately, there are not many pages, which is a rather stupid implementation method. Pay attention to the use of return. The recursive function calls the function itself, and return will return None. Here, we can get the solution through Baidu's query of relevant information.

The other content is normal operation, so it will not be explained here!

```def get_urllists(urls):
for url in urls:
i.start()
i.join()

When a parameter is called, args=(url,) and multithreading are used at the same time. It's a headache to collect and report errors. Basically, the server can't respond. Do you still need to use the Scrapy framework to grab it in a wide range.

Operation effect:

Source code reference attached:

```from fake_useragent import UserAgent
import requests,time,os
from lxml import etree

def ua():
ua=UserAgent()

def get_pageurl(url):
pageurl=[]
time.sleep(1)
req=etree.HTML(html)
pagelists=req.xpath('//div[@class="row-page tc"]/a/@href')
for pagelist in pagelists:
if "baike" in pagelist:
pagelist=f"http://www.toubang.tv{pagelist}"
pageurl.append(pagelist)

#print(len(pageurl))
return pageurl

def get_apgeurls(apgeurls):
page_urls=[]
for apgeurl in apgeurls:
page_url=get_pageurl(apgeurl)
page_urls.extend(page_url)

page_urls=set(page_urls)
#print(len(page_urls))

if len(page_urls) < 5:
#if len(page_urls) < 65:
return get_apgeurls(page_urls) #Sprocket
else:
return page_urls

def get_urllist(url):
time.sleep(1)
req = etree.HTML(html)
hrefs=req.xpath('//div[@class="h5 ellipsis"]/a/@href')
print(hrefs)
for href in hrefs:
href=f'http://www.toubang.tv{href}'
get_info(href)

def get_urllists(urls):
for url in urls:
i.start()
i.join()

def get_info(url):
time.sleep(1)
req = etree.HTML(html)
name=req.xpath('//div[@class="h3 ellipsis"]/span[@class="title"]/text()')[0]
os.makedirs(f'{name}/', exist_ok=True)  # Create directory
briefs=req.xpath('//dl[@class="game-tag clearfix"]/dd/span//text()')
brief_img=req.xpath('//div[@class="i-img fl mr20"]/img/@src')[0].split('=')[1]
print(name)
print(briefs)
print(brief_img)
down_img(brief_img, name)
informations=req.xpath('//table[@class="table-fixed table-hover hot-search-play"]/tbody/tr[@class="baike-bar"]/td//text()')
for information in informations:
if '\r' and '\n' and '\t' not in information:
print(information)

text=req.xpath('//div[@class="text-d"]/p//text()')
print(text)
text_imgs=req.xpath('//div[@id="wrapBox1"]/ul[@id="count1"]/li/a[@class="img_wrap"]/@href')
print(text_imgs)
for text_img in text_imgs:

i.start()
i.join()

def down_img(img_url,name):
img_name=img_url.split('/')[-1]
time.sleep(2)
with open(f'{name}/{img_name}','wb') as f:
f.write(r.content)
print(f'>>>preservation{img_name}Picture successful!')

def main():
url = "http://www.toubang.tv/baike/list/20.html?p=hJvm3qMpTkjm8Rev+NDBTw=="
apgeurls = [url]
page_urls = get_apgeurls(apgeurls)
print(page_urls)
get_urllists(page_urls)

if __name__=='__main__':
main()```

Tags: Python

Posted on Sat, 13 Jun 2020 05:16:56 -0400 by pushpendra.php