Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty/columns check on HDFStore [Resolves #589] #592

Merged
merged 1 commit into from
Feb 6, 2019

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Feb 5, 2019

Pandas' read_hdf does not implement a method for only loading the first
n rows that works reliably on different datasets, particularly with multi-indices.
This is problematic because head_of_matrix is how the MatrixStore
implements empty() and columns() methods.

Instead of implementing head_of_matrix, we implement
empty() and columns() on HDFStore directly, obviating the need for a
head_of_matrix method that only reads the first row.

Pandas' read_hdf does not implement a method for only loading the first
n rows that works reliably on different datasets, particularly with multi-indices.
This is problematic because head_of_matrix is how the MatrixStore
implements empty() and columns() methods.

Instead of implementing head_of_matrix, we implement
empty() and columns() on HDFStore directly, obviating the need for a
head_of_matrix method that only reads the first row.
Copy link
Contributor

@nanounanue nanounanue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Will we build the HDF5's S3 storage using this new class?

except ValueError:
# There is no known way to make the start/stop operations work all the time
# , there is often a ValueError when trying to load just the first row
# However, if we do get a ValueError that means there is data so it can't be empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@codecov-io
Copy link

codecov-io commented Feb 5, 2019

Codecov Report

Merging #592 into master will increase coverage by <.01%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #592      +/-   ##
==========================================
+ Coverage   82.49%   82.49%   +<.01%     
==========================================
  Files          83       83              
  Lines        4845     4851       +6     
==========================================
+ Hits         3997     4002       +5     
- Misses        848      849       +1
Impacted Files Coverage Δ
src/triage/component/catwalk/storage.py 92.18% <71.42%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18726a0...fef2f4e. Read the comment docs.

@thcrock
Copy link
Contributor Author

thcrock commented Feb 5, 2019

As for whether we implement HDF-on-S3 here or elsewhere, I'm not totally sure, but implementing it here kind of makes sense I think.

One alternative, a new storage type (e.g. disk, s3, disk-backed s3) would I think have cleaner code, but a worse interface and performance. Making people choose a new storage type to handle this is kind of smelly, and since experiments use the same project path for both matrices and models, it would make all models go through a disk intermediary which would probably be bad performance-wise

@nanounanue nanounanue merged commit cd22069 into master Feb 6, 2019
@thcrock thcrock deleted the hdf_shapefix branch February 12, 2019 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants