[theory combined with practice] an analysis of the reading speed of xlrd, openpyxl and xlwings

Because pneumonia is closed at home, there is no school opening, and the internship has not started yet, so I will revie...

Because pneumonia is closed at home, there is no school opening, and the internship has not started yet, so I will review some small algorithms of data analysis in the internship practice last year, and then one of them attracted my attention. This algorithm is mainly for the data analysis of an ID information in the company. The problem is as follows:
Select a node in a tree (non binary tree), take this node as the root node, and output all the node values on this subtree (regardless of order).

  1. Data situation
    An excel table with two columns and 631040 rows of data. The first column is all nodes, and the second column is the parent node of the previous column. A total of 11.1MB data

  2. language
    python

In fact, the algorithm is very simple. It is processed according to the depth first algorithm. When reading data, I use openpyxl to read two columns of data into two parent-child lists for query. The code of openpyxl is as follows:

# Read with openpyxl childlist = [] fatherlist = [] excel = openpyxl.load_workbook(fileexcel) table = excel['Sheet1'] rownum = table.max_row for i in range(2,rownum + 1): childnum = table.cell(row = i, column = 1).value fathernum = table.cell(row = i, column = 2).value childlist.append(childnum) fatherlist.append(fathernum)

In fact, it's a process of traversal and acquisition, and then I try to use xlwings and xlrd to read data these days

# Whole column reading with xlwings app = xw.App(visible=False, add_book=False) excel = app.books.open(fileexcel) table = excel.sheets[0] print(table) rownum = table.api.UsedRange.Rows.count print(rownum) childlist = table.range((1, 1), (rownum, 1)).value fatherlist = table.range((1, 2), (rownum, 2)).value
# Traversal reading with xlwings childlist = [] fatherlist = [] excel = openpyxl.load_workbook(fileexcel) table = excel.sheets[0] rownum = table.api.UsedRange.Rows.count for i in range(2,rownum + 1): childnum = table.cell(row = i, column = 1).value fathernum = table.cell(row = i, column = 2).value childlist.append(childnum) fatherlist.append(fathernum)
# Traversal reading with xlrd childlist = [] fatherlist = [] excel = xlrd.open_workbook(fileexcel) table = excel.sheet_by_index(0) rownum = table.nrows for i in range(1, rownum): childnum = table.cell(i, 0).value fathernum = table.cell(i, 1).value childlist.append(childnum) fatherlist.append(fathernum)
# Read with xlrd whole column excel = xlrd.open_workbook(fileexcel) table = excel.sheet_by_index(0) rownum = table.nrows childlist = table.col_values(0) fatherlist = table.col_values(1)

I also recorded the time by the way, and the result was as follows:

Reading method Read time (seconds) openpyxl read (traversal) 36.377 Read with xlwings (whole column) 8.129 Read (traverse) with xlwings Not finished after 300 seconds Read with xlrd (whole column) 24.923 Read (traverse) with xlrd 26.262

Excluding time floating, it is found that the whole column of xlwings reads very fast, but the traversal speed is quite slow. The overall situation of openpyxl is a little slower. The xlrd reading is in the middle stream, but the reading speed of the whole column and traversal is not much different. I think this should be written differently. The processing methods should be traversal. Based on the relationship between xlwings and vba, I think Maybe vba processed it when reading. For openpyxl, I looked up the information on the Internet, but I couldn't find the code writing method for reading the whole column, so I couldn't test it.

(ps:python Xiaobai, this article is purely based on personal opinions. If you have better operation methods, please correct them, thank you.)

Eberly Published 1 original article, praised 0, visited 8 Private letter follow

15 February 2020, 03:10 | Views: 6572

Add new comment

For adding a comment, please log in
or create account

0 comments