Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Optimize read_excel nrows #46894

Merged
merged 15 commits into from
Jun 5, 2022
Merged

Conversation

ahawryluk
Copy link
Contributor

I timed read_excel on a file with 10 columns and 1000 rows, and report the best of 10 repeats (ms) below. This change can make a modest improvement on xls and ods files, and a significant improvement on xlsx and xlsb files. When nrows=None this has no measurable impact on the run time.

ext nrows time (main) time (this branch)
xls None 22.4 22.1
xls 10 21.4 17.0
xlsx None 99.1 99.7
xlsx 10 98.0 8.8
xlsb None 81.0 80.2
xlsb 10 80.2 4.8
ods None 571 569
ods 10 566 517

Here are the results of asv run -e -E existing --bench ReadExcel showing similar results (the benchmark spreadsheet is different than the one above).

[75.00%] ··· io.excel.ReadExcel.time_read_excel                              ok
[75.00%] ··· ========== ============
               engine               
             ---------- ------------
                xlrd     36.5±0.3ms 
              openpyxl   162±0.1ms  
                odf       688±5ms   
             ========== ============

[100.00%] ··· io.excel.ReadExcelNRows.time_read_excel                         ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd     24.5±0.1ms 
               openpyxl   29.0±0.1ms 
                 odf       508±3ms   
              ========== ============

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice gain, especially in xlsx (I think that's pretty common). Do you know why you don't see similar gains in old/xls?

pandas/tests/io/excel/test_readers.py Outdated Show resolved Hide resolved
pandas/io/excel/_base.py Outdated Show resolved Hide resolved
pandas/io/excel/_base.py Outdated Show resolved Hide resolved
pandas/io/excel/_base.py Show resolved Hide resolved
pandas/io/excel/_base.py Outdated Show resolved Hide resolved
header_rows += 1
if skiprows is None:
return header_rows + nrows
if isinstance(skiprows, int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_integer again

pandas/io/excel/_base.py Show resolved Hide resolved
@rhshadrach rhshadrach added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Apr 29, 2022
@jreback jreback added this to the 1.5 milestone Apr 29, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls try to reuse as much of the parser validation code as possible

pandas/io/excel/_base.py Outdated Show resolved Hide resolved
@ahawryluk
Copy link
Contributor Author

Nice gain, especially in xlsx (I think that's pretty common). Do you know why you don't see similar gains in old/xls?

Thanks! In the case of xls, xlrd parses all of the data on first read, so there's no saving in I/O time, just a small saving inside get_sheet_data. On the other hand, I was surprised that ods showed so little improvement, especially since ods and xlsx are both zip files containing xml, as far as I understand. I'm also not sure why ods is so much slower than xlsx in the first place.

Thanks for the great recommendations on this PR. I'll take a look at them in the coming days.

@ahawryluk
Copy link
Contributor Author

@rhshadrach I've made the suggested changes and they were all huge improvements except maybe the type hints. I have one failing mypy error that I'm not sure how to fix. Also, in my last commit I added a bunch of assert isinstance statements to work around the fact that mypy doesn't understand the type narrowing of e.g. is_integer. There were a few similar assert statements in _base.py, but I'm not convinced that my most recent commit was an improvement. It seems like I've lost readability, conciseness, and the original benefit of is_integer versus isinstance(int). Maybe I should use type hint Any in validate_header_arg? I'm open to any suggestions you have. (No rush at all.) Have a great weekend.

if skiprows is None:
return header_rows + nrows
if is_integer(skiprows):
assert isinstance(skiprows, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we'll be able to have is_integer be a typeguard to narrow down automatically, but for now you can either use # type: ignore[reason] comments or var = cast(type, var) to satisfy mypy. I'd recommend the latter here: skiprows = cast(int, skiprows). Similarly for the other blocks.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes - looks really good. I like the use of is_list_like, but I think we should be excluding sets here; can you pass allow_sets=False.

Comment on lines -123 to -129
if isinstance(self.header, (list, tuple, np.ndarray)):
if not all(map(is_integer, self.header)):
raise ValueError("header must be integer or list of integers")
if any(i < 0 for i in self.header):
raise ValueError(
"cannot specify multi-index header with negative integers"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing it, is validate_header_arg called somewhere instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it took me a while to find it. It's called earlier in TextFileReader:

  pandas/io/parsers/readers.py(1847)TextParser()
-> return TextFileReader(*args, **kwds)

  pandas/io/parsers/readers.py(1412)__init__()
-> self.options, self.engine = self._clean_options(options, engine)

  pandas/io/parsers/readers.py(1607)_clean_options()
-> validate_header_arg(options["header"])

Sets for skiprows currently work, so I've left that as is.
@ahawryluk
Copy link
Contributor Author

@rhshadrach I think this is ready for review. It didn't pass all tests, but failures were different the last two times, so I suspect that these failures aren't from this PR. Of course, let me know if I'm wrong about that.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more allow_sets=False, otherwise looks good.

if not (
is_sequence
is_list_like(self.index_col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_sets=False

raise ValueError(
"cannot specify multi-index header with negative integers"
)
if is_list_like(self.header):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_sets=False

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @jreback for review

@jreback jreback merged commit 9e10206 into pandas-dev:main Jun 5, 2022
@jreback
Copy link
Contributor

jreback commented Jun 5, 2022

thanks @ahawryluk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel opimize nrows
3 participants