[theory combined with practice] an analysis of the reading speed of xlrd, openpyxl and xlwings

Because pneumonia is closed at home, there is no school opening, and the internship has not started yet, so I will review some small algorithms of data analysis in the internship practice last year, and then one of them attracted my attention. This algorithm is mainly for the data analysis of an ID information in the company. The problem is as follows:
Select a node in a tree (non binary tree), take this node as the root node, and output all the node values on this subtree (regardless of order).

  1. Data situation
    An excel table with two columns and 631040 rows of data. The first column is all nodes, and the second column is the parent node of the previous column. A total of 11.1MB data

  2. language
    python

In fact, the algorithm is very simple. It is processed according to the depth first algorithm. When reading data, I use openpyxl to read two columns of data into two parent-child lists for query. The code of openpyxl is as follows:

# Read with openpyxl
childlist = []
fatherlist = []
excel = openpyxl.load_workbook(fileexcel)
table = excel['Sheet1']
rownum = table.max_row
for i in range(2,rownum + 1):
    childnum = table.cell(row = i, column = 1).value
    fathernum = table.cell(row = i, column = 2).value
    childlist.append(childnum)
    fatherlist.append(fathernum)

In fact, it's a process of traversal and acquisition, and then I try to use xlwings and xlrd to read data these days

# Whole column reading with xlwings
app = xw.App(visible=False, add_book=False)
excel = app.books.open(fileexcel)
table = excel.sheets[0]
print(table)
rownum = table.api.UsedRange.Rows.count
print(rownum)
childlist = table.range((1, 1), (rownum, 1)).value
fatherlist = table.range((1, 2), (rownum, 2)).value
# Traversal reading with xlwings
childlist = []
fatherlist = []
excel = openpyxl.load_workbook(fileexcel)
table = excel.sheets[0]
rownum = table.api.UsedRange.Rows.count
for i in range(2,rownum + 1):
    childnum = table.cell(row = i, column = 1).value
    fathernum = table.cell(row = i, column = 2).value
    childlist.append(childnum)
    fatherlist.append(fathernum)
# Traversal reading with xlrd
childlist = []
fatherlist = []
excel = xlrd.open_workbook(fileexcel)
table = excel.sheet_by_index(0)
rownum = table.nrows
for i in range(1, rownum):
	childnum = table.cell(i, 0).value
    fathernum = table.cell(i, 1).value
    childlist.append(childnum)
    fatherlist.append(fathernum)
# Read with xlrd whole column
excel = xlrd.open_workbook(fileexcel)
table = excel.sheet_by_index(0)
rownum = table.nrows
childlist = table.col_values(0)
fatherlist = table.col_values(1)

I also recorded the time by the way, and the result was as follows:

Reading method Read time (seconds)
openpyxl read (traversal) 36.377
Read with xlwings (whole column) 8.129
Read (traverse) with xlwings Not finished after 300 seconds
Read with xlrd (whole column) 24.923
Read (traverse) with xlrd 26.262

Excluding time floating, it is found that the whole column of xlwings reads very fast, but the traversal speed is quite slow. The overall situation of openpyxl is a little slower. The xlrd reading is in the middle stream, but the reading speed of the whole column and traversal is not much different. I think this should be written differently. The processing methods should be traversal. Based on the relationship between xlwings and vba, I think Maybe vba processed it when reading. For openpyxl, I looked up the information on the Internet, but I couldn't find the code writing method for reading the whole column, so I couldn't test it.

(ps:python Xiaobai, this article is purely based on personal opinions. If you have better operation methods, please correct them, thank you.)

Published 1 original article, praised 0, visited 8
Private letter follow

Tags: Excel Python

Posted on Sat, 15 Feb 2020 03:10:56 -0500 by Gary Tactic