Pi Pi Hui's way of reptile learning

Crawling comment information of tmall

Essential knowledge

  • URL (Uniform / Universal Resource Locator): uniform resource locator, an identification method that completely describes the addresses of web pages and other resources on the Internet.
  • Host: Specifies the host IP and port number of the requested resource. Its content is the location of the original server or gateway of the requested url. Starting from HTTP version 1.1, the request must contain this content.
  • Cookies: maintain the current access session. The server recognizes ourselves through cookies and finds out the login status when the current status is found. Therefore, the returned result is the web content that can be seen only after login.
  • Referer: this content is used to identify which page sent this request.
  • User agent: enables the server to identify the operating system and version, browser and version information used by the client. This information can be added to the crawler to disguise as a browser. If not, it is likely to be identified as a crawler
  • Content type: Internet media type, for example, text/html stands for HTML format, image/gif stands for GIF picture, and application/json stands for JSON type.

The process of crawling tmall product reviews

  • First, we find the product reviews that need to be crawled
  • Then go to the check page of the web page. The browser I use here is chrome 73. Just right click and select check
  • Then, in order to crawl multiple pages of comments, we need to find the page that the system will call when we click on the next page, and then we can find the information we need there, and then make regular matching according to the information we need, and write a crawler.
  • As shown in the figure below, we can see the url here
  • And here we can get Referer and user agent
  • Then we find the list detail in the network, where we have the crawling information we need
  • In the end, the code is written directly, and comments are written for each part
# Import required libraries
import requests
import json
import re

# Macro variable stores URL list of target js
COMMENT_PAGE_URL = []


# Generate link list
def Get_Url(num):
    COMMENT_PAGE_URL.clear()
    # urlFront ='https://rate.tmall.com/list_detail_rate.htm?itemId=591204720926&spuId=1196687088&sellerId=2057500938&order=3&currentPage=2&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvpvvfvw6vUvCkvvvvvjiPn2Fvzji2R2SWtj3mPmPwtjl8P2zZzj1PP2MhsjEhdphvmpvZ8vi3OppjoUhCvvswN8iy0YMwzPAQ6DItvpvhvvvvvUhCvvswPv1yMKMwzPQphHurvpvEvCo5vVYSCHDDdphvmQ9ZCQmj3QBBn4hARphvCvvvvvmrvpvEvvQRvy3UvRoi2QhvCvvvMMGCvpvVvvpvvhCvmphvLv99ApvjwYcEKOms5k9vibmXYC97W3dhA8oQrEtlB%2BFy%2BnezrmphQRAn3feAOHFIAXcBKFyK2ixrsj7J%2B3%2BdafmxfBkKNB3rsj7Q%2Bu0ivpvUvvmvRE8X69TEvpvVmvvC9jahKphv8vvvvvCvpvvvvvmm86CvmWZvvUUdphvWvvvv9krvpv3Fvvmm86CvmVWtvpvhvvvvv8wCvvpvvUmm3QhvCvvhvvmrvpvEvvFyvrzavm9VdphvhCQpVUCZxvvC7g0znsBBKaVCvpvLSH1a7z2SznswEjE4GDRi4IkisIhCvvswN8340nMwzPs5OHItvpvhvvvvv86Cvvyvh2%2BHj1GhPDervpvEvv1LCNL6Chi19phvHNlwM7L7qYswM22v7SEL4OVUTGqWgIhCvvswN83KTRMwzPQZ9DuCvpvZz2AufpfNznsGDnrfY%2FjwZr197Ih%3D&needFold=0&_ksTS=1584072063659_1149&callback=jsonp1150'
    urlFront = 'https://rate.tmall.com/list_detail_rate.htm?itemId=591204720926&spuId=1196687088&sellerId=2057500938&order=3&currentPage='
    urlRear ='&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvpvvfvw6vUvCkvvvvvjiPn2Fvzji2R2SWtj3mPmPwtjl8P2zZzj1PP2MhsjEhdphvmpvZ8vi3OppjoUhCvvswN8iy0YMwzPAQ6DItvpvhvvvvvUhCvvswPv1yMKMwzPQphHurvpvEvCo5vVYSCHDDdphvmQ9ZCQmj3QBBn4hARphvCvvvvvmrvpvEvvQRvy3UvRoi2QhvCvvvMMGCvpvVvvpvvhCvmphvLv99ApvjwYcEKOms5k9vibmXYC97W3dhA8oQrEtlB%2BFy%2BnezrmphQRAn3feAOHFIAXcBKFyK2ixrsj7J%2B3%2BdafmxfBkKNB3rsj7Q%2Bu0ivpvUvvmvRE8X69TEvpvVmvvC9jahKphv8vvvvvCvpvvvvvmm86CvmWZvvUUdphvWvvvv9krvpv3Fvvmm86CvmVWtvpvhvvvvv8wCvvpvvUmm3QhvCvvhvvmrvpvEvvFyvrzavm9VdphvhCQpVUCZxvvC7g0znsBBKaVCvpvLSH1a7z2SznswEjE4GDRi4IkisIhCvvswN8340nMwzPs5OHItvpvhvvvvv86Cvvyvh2%2BHj1GhPDervpvEvv1LCNL6Chi19phvHNlwM7L7qYswM22v7SEL4OVUTGqWgIhCvvswN83KTRMwzPQZ9DuCvpvZz2AufpfNznsGDnrfY%2FjwZr197Ih%3D&needFold=0&_ksTS=1584072063659_1149&callback=jsonp1150'
    for i in range(0, num):
        COMMENT_PAGE_URL.append(urlFront + str(1 + i) + urlRear)


# Get comment data
def GetInfo(num):
    # Define required fields
    nickname = []
    auctionSku = []
    ratecontent = []
    ratedate = []
    # Loop through comments on each page
    for i in range(num):
        # Header file, no header file will return incorrect js
        headers = {
            'cookie':'cna=AZIzFWjRuyICAToTAygn5OEJ; sm4=429006; hng=CN%7Czh-CN%7CCNY%7C156; lid=%E6%98%9F%E8%BE%89%E7%81%BF%E7%83%82%E4%B9%8B%E7%82%8E%E7%84%B1%E7%87%9A; t=1e17c56d1530f801b4c5dd9bc8793aa2; tracknick=%5Cu6',
            'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
            'referer': 'https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.1.3d067a51ue6QgE&id=591204720926&skuId=4065121357065&areaId=420100&user_id=2057500938&cat_id=50025174&is_b=1&rn=f3dfc9236475de95757ce169d42558a0',
            'accept': '*/*',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9'
        }#Disguised as browser access to prevent garbled code or access failure
        # Parsing JS file content
        print(COMMENT_PAGE_URL[i])
        content = requests.get(COMMENT_PAGE_URL[i], headers=headers).text  # Call http interface and get his text
        print(content)
        # nk = re.findall('"displayUserNick":"(.*?)"', content)
        # nickname.extend(nk)
        # # print(nk)
        #If you don't need some information, just delete the corresponding part. To add some information, just write a new rule as it is
        nickname.extend(re.findall('"displayUserNick":"(.*?)"', content))  # Regular expression matching stored in list
        auctionSku.extend(re.findall('"auctionSku":"(.*?)"', content))
        ratecontent.extend(re.findall('"rateContent":"(.*?)"', content))
        ratedate.extend(re.findall('"rateDate":"(.*?)"', content))
    # Write data to TEXT file
    for i in list(range(0, len(nickname))):
        text = ','.join((nickname[i], ratedate[i], auctionSku[i], ratecontent[i])) + '\n'
        with open(r"D:\\python\\python\\taobao_info\\TaoBao.txt", 'a+', encoding='UTF-8') as file:
            file.write(text + ' ')
            print(i + 1, ":Write successfully")


# Main function
if __name__ == "__main__":
    Page_Num = 20
    Get_Url(Page_Num)
    GetInfo(10)

  • Here is the cookie after I log in, so you can replace it with your own cookie when you use it.
  • There are many ways to view cookies. Here's what I do: input document.cookie in the console and press enter
Published 10 original articles, won praise 1, visited 339
Private letter follow

Tags: JSON encoding Python Session

Posted on Mon, 16 Mar 2020 03:03:58 -0400 by JD^