Fix empty/columns check on HDFStore [Resolves #589] #592

thcrock · 2019-02-05T17:08:21Z

Pandas' read_hdf does not implement a method for only loading the first
n rows that works reliably on different datasets, particularly with multi-indices.
This is problematic because head_of_matrix is how the MatrixStore
implements empty() and columns() methods.

Instead of implementing head_of_matrix, we implement
empty() and columns() on HDFStore directly, obviating the need for a
head_of_matrix method that only reads the first row.

Pandas' read_hdf does not implement a method for only loading the first n rows that works reliably on different datasets, particularly with multi-indices. This is problematic because head_of_matrix is how the MatrixStore implements empty() and columns() methods. Instead of implementing head_of_matrix, we implement empty() and columns() on HDFStore directly, obviating the need for a head_of_matrix method that only reads the first row.

nanounanue

So, Will we build the HDF5's S3 storage using this new class?

nanounanue · 2019-02-05T17:10:36Z

src/triage/component/catwalk/storage.py

+            except ValueError:
+                # There is no known way to make the start/stop operations work all the time
+                # , there is often a ValueError when trying to load just the first row
+                # However, if we do get a ValueError that means there is data so it can't be empty


codecov-io · 2019-02-05T17:20:28Z

Codecov Report

Merging #592 into master will increase coverage by <.01%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master     #592      +/-   ##
==========================================
+ Coverage   82.49%   82.49%   +<.01%     
==========================================
  Files          83       83              
  Lines        4845     4851       +6     
==========================================
+ Hits         3997     4002       +5     
- Misses        848      849       +1

Impacted Files	Coverage Δ
src/triage/component/catwalk/storage.py	`92.18% <71.42%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18726a0...fef2f4e. Read the comment docs.

thcrock · 2019-02-05T18:23:13Z

As for whether we implement HDF-on-S3 here or elsewhere, I'm not totally sure, but implementing it here kind of makes sense I think.

One alternative, a new storage type (e.g. disk, s3, disk-backed s3) would I think have cleaner code, but a worse interface and performance. Making people choose a new storage type to handle this is kind of smelly, and since experiments use the same project path for both matrices and models, it would make all models go through a disk intermediary which would probably be bad performance-wise

thcrock assigned nanounanue Feb 5, 2019

nanounanue approved these changes Feb 5, 2019

View reviewed changes

nanounanue merged commit cd22069 into master Feb 6, 2019

thcrock deleted the hdf_shapefix branch February 12, 2019 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix empty/columns check on HDFStore [Resolves #589] #592

Fix empty/columns check on HDFStore [Resolves #589] #592

thcrock commented Feb 5, 2019

nanounanue left a comment

nanounanue Feb 5, 2019

codecov-io commented Feb 5, 2019 •

edited

Loading

thcrock commented Feb 5, 2019

Fix empty/columns check on HDFStore [Resolves #589] #592

Fix empty/columns check on HDFStore [Resolves #589] #592

Conversation

thcrock commented Feb 5, 2019

nanounanue left a comment

Choose a reason for hiding this comment

nanounanue Feb 5, 2019

Choose a reason for hiding this comment

codecov-io commented Feb 5, 2019 • edited Loading

Codecov Report

thcrock commented Feb 5, 2019

codecov-io commented Feb 5, 2019 •

edited

Loading