Use Python to collect web quality data to Excel table

As we all know, Python has relatively third-party modules, so long as you are skilled in using these modules, you can accomplish various tasks. Before starting to collect data, you need to have a target site, and then use Python script to collect quality data. Python's pycurl module is needed to detect web quality. It can obtain HTTP request status code, DNS parsing time, connection time, total transmission time, download packet size, HTTP header size, average download speed and other parameters. From these parameters, we can understand the quality of service of the web, and then optimize it. Write the acquired data into the Excel table, using Python's xlsxwrite module. The idea is to save the acquired data into a file, then create an Excel table, write the data into the table and draw a chart. By analogy, the data in the file will be appended, and then create Excel. Table 1 will be completely overwritten, of course, there will be many modules to create Excel tables, which will not be detailed here.

Prepare for Python scripting:

  • download pycurl module, just double-click installation.
  • xlsxwriter is installed using the pip command, which requires attention to the configuration of environment variables.

1. Because pycurl is downloaded and installed directly, it's not written here. It's relatively simple.
2. Install xlsxwriter module (need to be connected to the Internet)

3. The data acquisition script is as follows:

# _._ coding:utf-8 _._
import os,sys
import pycurl
import xlsxwriter

URL="www.baidu.com"            #To detect the url of the target, which target needs to be detected and which one can be changed here
c = pycurl.Curl()                     #Create a curl object
c.setopt(pycurl.URL, URL)             #Define url constants for requests
c.setopt(pycurl.CONNECTTIMEOUT, 10)  #Define the waiting time for a request connection
c.setopt(pycurl.TIMEOUT, 10)          #Define request timeout 
c.setopt(pycurl.NOPROGRESS, 1)       #Shield download progress bar
c.setopt(pycurl.FORBID_REUSE, 1)      #Force disconnection after completing interaction, no reuse
c.setopt(pycurl.MAXREDIRS, 1)         #Maximum number of HTTP redirections specified is 1
c.setopt(pycurl.DNS_CACHE_TIMEOUT, 30)
#Create a file object that opens in'wb'to store the returned http header and page content
indexfile = open(os.path.dirname(os.path.realpath(__file__))+"/content.txt","wb")
c.setopt(pycurl.WRITEHEADER, indexfile)  #Directing the returned http header to the indexfile file
c.setopt(pycurl.WRITEDATA, indexfile)    #Directing the returned html content to the indexfile file
c.perform()

NAMELOOKUP_TIME = c.getinfo(c.NAMELOOKUP_TIME)  #Get DNS parsing time
CONNECT_TIME = c.getinfo(c.CONNECT_TIME)   #Get the connection setup time
TOTAL_TIME = c.getinfo(c.TOTAL_TIME)        #Get the total time of transmission
HTTP_CODE = c.getinfo(c.HTTP_CODE)     #Get HTTP status code
SIZE_DOWNLOAD = c.getinfo(c.SIZE_DOWNLOAD)   #Get download packet size
HEADER_SIZE = c.getinfo(c.HEADER_SIZE)    #Get HTTP header size
SPEED_DOWNLOAD=c.getinfo(c.SPEED_DOWNLOAD)   #Get Average Download Speed

print u"HTTP Status code: %s" %(HTTP_CODE)  #Output status code
print u"DNS Analytical time: %.2f ms" %(NAMELOOKUP_TIME*1000)  #Output DNS parsing time
print u"Establishment of connection time: %.2f ms" %(CONNECT_TIME*1000)  #Output Connection Establishment Time
print u"Total transmission end time: %.2f ms" %(TOTAL_TIME*1000)   #Total Output Transfer End Time
print u"Download Packet Size: %d bytes/s" %(SIZE_DOWNLOAD)  #Output download packet size
print u"HTTP Head size: %d byte" %(HEADER_SIZE)   #Output HTTP header size
print u"Average download speed: %d bytes/s" %(SPEED_DOWNLOAD)  #Output average download speed
indexfile.close()          #Close file
c.close()                #Close the curl object

f = file('chart.txt','a')     #Open a chart.txt file to append
f.write(str(HTTP_CODE)+','+str(NAMELOOKUP_TIME*1000)+','+str(CONNECT_TIME*1000)+','+str(TOTAL_TIME*1000)+','+str(SIZE_DOWNLOAD/1024)+','+str(HEADER_SIZE)+','+str(SPEED_DOWNLOAD/1024)+'\n')               #Write the output above to the chart.txt file
f.close()                #Close chart.txt file

workbook = xlsxwriter.Workbook('chart.xlsx')   #Create an excel file for chart.xlsx      
worksheet = workbook.add_worksheet()      #Create a worksheet object, default to Sheet1
chart = workbook.add_chart({'type': 'column'})  #Create a chart object

title = [URL , u' HTTP Status code',u' DNS Analytical time',u' Establishment of connection time',u' End time of transmission',u' Download Packet Size',u' HTTP Head size',u' Average download speed']   #Define a list of data table headers

format=workbook.add_format()    #Define format format objects
format.set_border(1)      #Define format for format object cell border bolding (1 pixel)

format_title=workbook.add_format()   #Define format_title format objects
format_title.set_border(1)   #Define the format of format_title object cell border bolding (1 pixel)
format_title.set_bg_color('#00FF00')  #Definition format_title The background color of the object cell is'#cccccc'

format_title.set_align('center')   #Define format_title object cell centered format
format_title.set_bold()         #Define format_title object cell content coarsening format

worksheet.write_row(0, 0,title,format_title)    #Write the title content to the first line

f = open('chart.txt','r')          #Open chart.txt file as read-only
line = 1                   #Define variable line equal to 1
for i in f:                  #Open for loop to read files
    head = [line]           #Define variable head equals line
    lineList = i.split(',')       #Converting Strings into List Forms
    lineList = map(lambda i2:int(float(i2.replace("\n", ''))), lineList)  #Delete the lastn in the list, delete the number after the decimal point, and convert the floating-point type into an integer
    lineList = head + lineList                                  #Add up two lists
    worksheet.write_row(line, 0, lineList, format)              #Write the data into the execl table
    line += 1

average = [u'average value', '=AVERAGE(B2:B' + str((line - 1)) +')', '=AVERAGE(C2:C' + str((line - 1)) +')', '=AVERAGE(D2:D' + str((line - 1)) +')', '=AVERAGE(E2:E' + str((line - 1)) +')', '=AVERAGE(F2:F' + str((line - 1)) +')', '=AVERAGE(G2:G' + str((line - 1)) +')', '=AVERAGE(H2:H' + str((line - 1)) +')']           #Find the average of each column
worksheet.write_row(line, 0, average, format)    #Write the average below the last line of data
f.close()            #Close file

def chart_series(cur_row, line):     #Define a function
    chart.add_series({
        'categories': '=Sheet1!$B$1:$H$1',    #The parameters to be output are labeled as chart data (X axis)
        'values':     '=Sheet1!$B$'+cur_row+':$H$'+cur_row,   #Get data from column B to column H        
        'line':       {'color': 'black'},           #Line color is defined as black
        'name':    '=Sheet1!$A'+ cur_row,                  #Refer to business name as legend item
    })

for row in range(2, line + 1):  #Data Series Function Calls from the Second Line to the Last Line in Text
    chart_series(str(row), line)

chart.set_size({'width':876,'height':287})    #Define the width and height of the chart

worksheet.insert_chart(line + 2, 0, chart)    #Insert the chart in two rows below the last row of data
workbook.close()                     #Close the execl document

4. After running the script, three files will be generated in the directory where the script is located. Two are txt text files and one is Excel files. After executing the script, the following information will be displayed:

5. The files generated under the current directory are as follows:

Among them, the two txt format files are designed to pave the way for Excel, so they can be selectively ignored, mainly looking at the data in Excel. The data in Excel is as follows (the following is the display result after executing six scripts, that is to say, six probes):

———————— This is the end of the article. Thank you for reading.————————

Tags: Programming pycurl Excel Python DNS

Posted on Mon, 23 Sep 2019 04:15:45 -0400 by datona