Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependencies._column_loc: files parameter has a mismatch between typing and implementation #407

Closed
ChristianGeng opened this issue May 13, 2024 · 4 comments · Fixed by #415

Comments

@ChristianGeng
Copy link
Member

ChristianGeng commented May 13, 2024

Problem

Typing of Dependencies._column_loc seem not to be matching the implementation.

  • Input parameter files seems to be implemented for scalar input, whereas the type hint suggests that list input should be possible.

For scalar input df.at[files, column] would crash, whereas df.loc[files, column].values.tolist() would return a list.

  • In addition, the type conversion seems also to be implemented for scalar input only.

Possible solutions

  • change typing to only use scalar input and keep implementation
  • change the implementation and allow both, and keep with current typing. Doing so, in case of scalar input a scalar would be returned, in case of a list, also a list (of length corresponding to input length) is returned.

Additional points

  • why not allow tuples in typing?

Example

>> df.at['file-1.wav', column]
16
>>  df.at[['file-1.wav', 'file-0.wav'], column]
*** pandas.errors.InvalidIndexError: ['file-1.wav', 'file-0.wav']
 
>> df.loc[['file-1.wav', 'file-0.wav'], column].values.tolist()
[16, 16]


>> df.loc[['file-1.wav'], column].values.tolist()
[16]

Further Background

Benchmarking in benchmark-dependencies-methods.py uses list comprehensions of the form:

   [deps.bit_depth(file) for file in _files]

When evaluating polars this becomes tremendously slow (57 sec.).

However when I implement this based on a version of Dependencies._column_loc that can process iterable input and call it with deps.bit_depth(_files) this is fast (0.009280681610107422 sec.) which is a factor or more than 6000 times faster.

Should this issue maybe resolved before #385 can be meaningfully added to the benchmarks?

Saying that I would of course favor implementing the vectorized version of _column_loc - but I do not know how common the use of the list comprehension indeed is.

What would your take on this be @hagenw ?

@hagenw
Copy link
Member

hagenw commented May 27, 2024

I had implemented this in #370, but was not able to find a real benefit and reverted the changes in #393.
A solution to this problem in line with #393 would be to change the typing to support only a scalar.

@ChristianGeng
Copy link
Member Author

ChristianGeng commented May 29, 2024

I had implemented this in #370, but was not able to find a real benefit and reverted the changes in #393. A solution to this problem in line with #393 would be to change the typing to support only a scalar.

In the context of benchmarking polars methods I have found quite large differences I have found quite large differences. typically a spedup factor of 600 or so.

I fear that the #370 implementation is lost as the commits got squared, right?

This is a second independent run, most times unvectorized is in the 2 seconds ballbark again.

method pyarrow string object
Dependencies.call() 0.000 0.000 0.000
Dependencies.contains(10000 files) 0.003 0.003 0.003
Dependencies.get_item(10000 files) 0.652 0.252 0.165
Dependencies.len() 0.000 0.000 0.000
Dependencies.str() 0.009 0.006 0.003
Dependencies.archives 0.146 0.150 0.122
Dependencies.attachments 0.019 0.026 0.018
Dependencies.attachment_ids 0.018 0.031 0.018
Dependencies.files 0.012 0.012 0.012
Dependencies.media 0.069 0.115 0.106
Dependencies.removed_media 0.065 0.107 0.072
Dependencies.table_ids 0.025 0.037 0.026
Dependencies.tables 0.017 0.026 0.017
Dependencies.archive(10000 files) / vectorized 0.011 0.010 0.009
Dependencies.archive(10000 files) 0.045 0.030 0.026
Dependencies.bit_depth(10000 files) / vectorized 0.002 0.003 0.002
Dependencies.bit_depth(10000 files) 2.786 2.171 1.653
Dependencies.channels(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.channels(10000 files) 2.832 2.156 1.696
Dependencies.checksum(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.checksum(10000 files) 2.468 1.975 1.467
Dependencies.duration(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.duration(10000 files) 2.772 2.106 1.629
Dependencies.format(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.format(10000 files) 2.519 1.979 1.457
Dependencies.removed(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.removed(10000 files) 2.788 2.145 1.652
Dependencies.sampling_rate(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.sampling_rate(10000 files) 2.824 2.165 1.707
Dependencies.type(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.type(10000 files) 2.873 2.066 1.660
Dependencies.version(10000 files) / vectorized 0.003 0.003 0.003
Dependencies.version(10000 files) 2.605 1.900 1.519
Dependencies._add_attachment() 0.194 0.063 0.089
Dependencies._add_media(10000 files) 0.076 0.061 0.063
Dependencies._add_meta() 0.140 0.188 0.182
Dependencies._drop() 0.125 0.079 0.081
Dependencies._remove() 0.069 0.073 0.067
Dependencies._update_media() 0.153 0.084 0.088
Dependencies._update_media_version(10000 files) 0.029 0.010 0.010

@hagenw
Copy link
Member

hagenw commented May 29, 2024

I fear that the #370 implementation is lost as the commits got squared, right?

I could check out the branch locally, and pushed it again. So, I think you should be able to fetch it now as well. Otherwise, the changes are visible at https://github.com/audeering/audb/pull/370/files.

This is a second independent run

The unvectorized results are for the implementation presented here?

For comparison, results from #370

image

image

results from you

image

results from current main

image

Important is the pyarrow column. There your vectorized results are the fastest.

Also my implementation from #370 shows faster results for the vectorized solution. The main reason, I decided against #370 was that it does not translate into any real world advantages when loading a database, as briefly mentioned in #375 (comment),
and that it complicates the API as the return value now depends on the input value, which is not so nice.

@ChristianGeng
Copy link
Member Author

ChristianGeng commented May 29, 2024

I have now tried out your version that uses .at for single files.
I can verify that .at is much faster than .locfor single files. The problem is then not the list comprehension but the .loc in the implementation.

In this case I would be inline with you and do away the type hinting for sequences (and have only single file input).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants