PERF: Optimize read_excel nrows #46894

ahawryluk · 2022-04-29T05:06:53Z

closes read_excel opimize nrows #32727
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I timed read_excel on a file with 10 columns and 1000 rows, and report the best of 10 repeats (ms) below. This change can make a modest improvement on xls and ods files, and a significant improvement on xlsx and xlsb files. When nrows=None this has no measurable impact on the run time.

ext	nrows	time (main)	time (this branch)
xls	None	22.4	22.1
xls	10	21.4	17.0
xlsx	None	99.1	99.7
xlsx	10	98.0	8.8
xlsb	None	81.0	80.2
xlsb	10	80.2	4.8
ods	None	571	569
ods	10	566	517

Here are the results of asv run -e -E existing --bench ReadExcel showing similar results (the benchmark spreadsheet is different than the one above).

[75.00%] ··· io.excel.ReadExcel.time_read_excel                              ok
[75.00%] ··· ========== ============
               engine               
             ---------- ------------
                xlrd     36.5±0.3ms 
              openpyxl   162±0.1ms  
                odf       688±5ms   
             ========== ============

[100.00%] ··· io.excel.ReadExcelNRows.time_read_excel                         ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd     24.5±0.1ms 
               openpyxl   29.0±0.1ms 
                 odf       508±3ms   
              ========== ============

rhshadrach

Nice gain, especially in xlsx (I think that's pretty common). Do you know why you don't see similar gains in old/xls?

pandas/tests/io/excel/test_readers.py

pandas/io/excel/_base.py

rhshadrach · 2022-04-29T21:12:09Z

pandas/io/excel/_base.py

+            header_rows += 1
+        if skiprows is None:
+            return header_rows + nrows
+        if isinstance(skiprows, int):


is_integer again

pandas/io/excel/_base.py

jreback

pls try to reuse as much of the parser validation code as possible

pandas/io/excel/_base.py

ahawryluk · 2022-04-30T03:32:21Z

Nice gain, especially in xlsx (I think that's pretty common). Do you know why you don't see similar gains in old/xls?

Thanks! In the case of xls, xlrd parses all of the data on first read, so there's no saving in I/O time, just a small saving inside get_sheet_data. On the other hand, I was surprised that ods showed so little improvement, especially since ods and xlsx are both zip files containing xml, as far as I understand. I'm also not sure why ods is so much slower than xlsx in the first place.

Thanks for the great recommendations on this PR. I'll take a look at them in the coming days.

ahawryluk · 2022-05-07T01:30:49Z

@rhshadrach I've made the suggested changes and they were all huge improvements except maybe the type hints. I have one failing mypy error that I'm not sure how to fix. Also, in my last commit I added a bunch of assert isinstance statements to work around the fact that mypy doesn't understand the type narrowing of e.g. is_integer. There were a few similar assert statements in _base.py, but I'm not convinced that my most recent commit was an improvement. It seems like I've lost readability, conciseness, and the original benefit of is_integer versus isinstance(int). Maybe I should use type hint Any in validate_header_arg? I'm open to any suggestions you have. (No rush at all.) Have a great weekend.

rhshadrach · 2022-05-07T13:50:55Z

pandas/io/excel/_base.py

+        if skiprows is None:
+            return header_rows + nrows
+        if is_integer(skiprows):
+            assert isinstance(skiprows, int)


In the future we'll be able to have is_integer be a typeguard to narrow down automatically, but for now you can either use # type: ignore[reason] comments or var = cast(type, var) to satisfy mypy. I'd recommend the latter here: skiprows = cast(int, skiprows). Similarly for the other blocks.

rhshadrach

Thanks for the changes - looks really good. I like the use of is_list_like, but I think we should be excluding sets here; can you pass allow_sets=False.

rhshadrach · 2022-05-11T01:46:26Z

pandas/io/parsers/base_parser.py

-        if isinstance(self.header, (list, tuple, np.ndarray)):
-            if not all(map(is_integer, self.header)):
-                raise ValueError("header must be integer or list of integers")
-            if any(i < 0 for i in self.header):
-                raise ValueError(
-                    "cannot specify multi-index header with negative integers"
-                )


I might be missing it, is validate_header_arg called somewhere instead?

Yes, but it took me a while to find it. It's called earlier in TextFileReader:

pandas/io/parsers/readers.py(1847)TextParser() -> return TextFileReader(*args, **kwds) pandas/io/parsers/readers.py(1412)__init__() -> self.options, self.engine = self._clean_options(options, engine) pandas/io/parsers/readers.py(1607)_clean_options() -> validate_header_arg(options["header"])

Sets for skiprows currently work, so I've left that as is.

ahawryluk · 2022-05-13T03:13:59Z

@rhshadrach I think this is ready for review. It didn't pass all tests, but failures were different the last two times, so I suspect that these failures aren't from this PR. Of course, let me know if I'm wrong about that.

rhshadrach

Just a few more allow_sets=False, otherwise looks good.

rhshadrach · 2022-05-17T21:11:35Z

pandas/io/parsers/base_parser.py

                if not (
-                    is_sequence
+                    is_list_like(self.index_col)


allow_sets=False

rhshadrach · 2022-05-17T21:12:31Z

pandas/io/parsers/base_parser.py

-                raise ValueError(
-                    "cannot specify multi-index header with negative integers"
-                )
+        if is_list_like(self.header):


allow_sets=False

rhshadrach

lgtm, @jreback for review

pandas/io/parsers/base_parser.py

jreback · 2022-06-05T23:49:23Z

thanks @ahawryluk

ahawryluk added 5 commits April 27, 2022 21:18

Check for nrows in read_excel

6fd0422

Add nrows tests for multiindex and skiprows

e4d52d8

code linting and fix a bug in a test

eb83a74

What's new entry

2c643c7

Add asv test with nrows=10

dc14ac2

rhshadrach requested changes Apr 29, 2022

View reviewed changes

rhshadrach added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Apr 29, 2022

jreback added this to the 1.5 milestone Apr 29, 2022

jreback requested changes Apr 29, 2022

View reviewed changes

pandas/io/excel/_base.py Outdated Show resolved Hide resolved

ahawryluk added 6 commits May 5, 2022 21:06

Parametrize new tests

2e79141

Docstrings and type hints

943f866

Add PR #

10a8a3e

Use is_integer and validate_integer

30a280c

Consolidate header arg validation

bbc1f1d

Attempting to placate mypy with assert statements

2018d26

rhshadrach reviewed May 7, 2022

View reviewed changes

make type checks pass with typing.cast

c108a97

rhshadrach requested changes May 11, 2022

View reviewed changes

ahawryluk added 2 commits May 10, 2022 22:22

Do not allow sets for header arg

3aad835

Sets for skiprows currently work, so I've left that as is.

Merge branch 'main' into opt_excel_nrows

7c913d6

rhshadrach requested changes May 17, 2022

View reviewed changes

Two more allow_set=False

d6e1df3

rhshadrach approved these changes May 26, 2022

View reviewed changes

jreback reviewed May 27, 2022

View reviewed changes

pandas/io/parsers/base_parser.py Show resolved Hide resolved

jreback approved these changes Jun 5, 2022

View reviewed changes

jreback merged commit 9e10206 into pandas-dev:main Jun 5, 2022

ahawryluk deleted the opt_excel_nrows branch June 7, 2022 03:22

ahawryluk mentioned this pull request Jun 14, 2022

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

Closed

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

PERF: Optimize read_excel nrows (pandas-dev#46894)

db84f92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Optimize read_excel nrows #46894

PERF: Optimize read_excel nrows #46894

ahawryluk commented Apr 29, 2022

rhshadrach left a comment

rhshadrach Apr 29, 2022

jreback left a comment

ahawryluk commented Apr 30, 2022

ahawryluk commented May 7, 2022

rhshadrach May 7, 2022

rhshadrach left a comment

rhshadrach May 11, 2022

ahawryluk May 11, 2022

ahawryluk commented May 13, 2022

rhshadrach left a comment

rhshadrach May 17, 2022

rhshadrach May 17, 2022

rhshadrach left a comment

jreback commented Jun 5, 2022

PERF: Optimize read_excel nrows #46894

PERF: Optimize read_excel nrows #46894

Conversation

ahawryluk commented Apr 29, 2022

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Apr 29, 2022

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

ahawryluk commented Apr 30, 2022

ahawryluk commented May 7, 2022

rhshadrach May 7, 2022

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach May 11, 2022

Choose a reason for hiding this comment

ahawryluk May 11, 2022

Choose a reason for hiding this comment

ahawryluk commented May 13, 2022

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach May 17, 2022

Choose a reason for hiding this comment

rhshadrach May 17, 2022

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

jreback commented Jun 5, 2022