Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading H5DF files is broken #589

Closed
nanounanue opened this issue Feb 4, 2019 · 2 comments
Closed

Reading H5DF files is broken #589

nanounanue opened this issue Feb 4, 2019 · 2 comments
Assignees

Comments

@nanounanue
Copy link
Contributor

Writing matrices to disk was completed succesfully, but when triage tries to read the matrices:

INFO:root:Starting train/test for 1 out of 15: train range: 2015-02-01 00:00:00 to 2015-12-01 00:00:00
INFO:root:Generating train/test tasks for split 442c7fb8e1efac9afd2a3c2ad2b5a087
ERROR:root:Run interrupted by uncaught exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/triage/experiments/base.py", line 678, in run
    self._run_profile()
  File "/usr/local/lib/python3.6/site-packages/triage/experiments/base.py", line 663, in _run_profile
    cp.runcall(self._run)
  File "/usr/local/lib/python3.6/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/usr/local/lib/python3.6/site-packages/triage/experiments/base.py", line 626, in _run
    self.train_and_test_models()
  File "/usr/local/lib/python3.6/site-packages/triage/experiments/base.py", line 602, in train_and_test_models
    tasks = self._all_train_test_tasks()
  File "/usr/local/lib/python3.6/site-packages/triage/experiments/base.py", line 596, in _all_train_test_tasks
    model_comment=self.config.get('model_comment', None)
  File "/usr/local/lib/python3.6/site-packages/triage/component/catwalk/__init__.py", line 29, in generate_tasks
    if train_store.empty:
  File "/usr/local/lib/python3.6/site-packages/triage/component/catwalk/storage.py", line 388, in empty
    head_of_matrix = self.head_of_matrix
  File "/usr/local/lib/python3.6/site-packages/triage/component/catwalk/storage.py", line 523, in head_of_matrix
    head_of_matrix = pd.read_hdf(self.matrix_base_store.path, start=0, stop=1)
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 394, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 741, in select
    return it.get_result()
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 1483, in get_result
    results = self.func(self.start, self.stop, where)
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 734, in func
    columns=columns)
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2928, in read
    ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2517, in read_index
    return self.read_multi_index(key, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2618, in read_multi_index
    verify_integrity=True)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 242, in __new__
    result._verify_integrity()
  File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 280, in _verify_integrity
    len(level)))
ValueError: On level 1, label max (7) >= length of level  (1). NOTE: this index is in an inconsistent state
[ValueError] On level 1, label max (7) >= length of level  (1). NOTE: this index is in an inconsistent state

The libraries installed are:

root@c8ba6947de59:/triage# pip list
Package                  Version
------------------------ -------
alembic                  1.0.5  
argcmdr                  0.6.0  
argcomplete              1.9.4  
backcall                 0.1.0  
boto3                    1.9.71 
botocore                 1.12.86
Click                    7.0    
cycler                   0.10.0 
decorator                4.3.2  
Dickens                  1.0.1  
docutils                 0.14   
inflection               0.3.1  
ipython                  7.2.0  
ipython-genutils         0.2.0  
jedi                     0.13.2 
jmespath                 0.9.3  
Mako                     1.0.7  
MarkupSafe               1.1.0  
matplotlib               2.1.2  
numexpr                  2.6.9  
numpy                    1.16.1 
pandas                   0.23.4 
parso                    0.3.2  
Pebble                   4.3.9  
pexpect                  4.6.0  
pickleshare              0.7.5  
pip                      10.0.1 
plumbum                  1.6.4  
prompt-toolkit           2.0.8  
psycopg2                 2.7.7  
psycopg2-binary          2.7.6.1
ptyprocess               0.6.0  
Pygments                 2.3.1  
pyparsing                2.3.1  
python-dateutil          2.7.5  
python-dotenv            0.10.1 
python-editor            1.0.4  
pytz                     2018.9 
PyYAML                   4.2b4  
retrying                 1.3.3  
s3fs                     0.2.0  
s3transfer               0.1.13 
scikit-learn             0.20.2 
scipy                    1.2.0  
seaborn                  0.9.0  
setuptools               39.2.0 
signalled-timeout        1.0.0  
six                      1.12.0 
SQLAlchemy               1.2.15 
sqlalchemy-postgres-copy 0.5.0  
sqlparse                 0.2.4  
tables                   3.3.0  
traitlets                4.3.2  
triage                   3.2.1  
urllib3                  1.24.1 
wcwidth                  0.1.7  
wheel                    0.31.1 
wrapt                    1.10.11
@thcrock
Copy link
Contributor

thcrock commented Feb 4, 2019

I can reproduce this in DirtyDuck. Calling .matrix on the MatrixStore actually works, but something about head_of_matrix doesn't.

from triage.component.catwalk.storage import HDFMatrixStore, ProjectStorage
ps = ProjectStorage(..project path...)
ms = HDFMatrixStore(ps, ['matrices'], '3343ebf255af6dbb5204a60a4390c7e1')
ms.head_of_matrix()

gives ValueError: On level 1, label max (4) >= length of level (1). NOTE: this index is in an inconsistent state
however,

ms.matrix

outputs the matrix.

So maybe something about the shape of the index is throwing off the code that's trying to read the header, which is

head_of_matrix = pd.read_hdf(self.matrix_base_store.path, start=0, stop=1)

@thcrock
Copy link
Contributor

thcrock commented Feb 4, 2019

Honestly, start=0 and stop=0 seems to work

head_of_matrix = pd.read_hdf(ms.matrix_base_store.path, start=0, stop=0)

returns:

Columns: [severe_violations_entity_id_1year_failed::int_avg, severe_violations_entity_id_1year_failed::int_avg_imp, severe_violations_entity_id_1year_failed::int_max, severe_violations_entity_id_1year_failed::int_max_imp, severe_violations_entity_id_1year_failed::int_sum, severe_violations_entity_id_1year_failed::int_sum_imp, severe_violations_entity_id_3month_failed::int_avg, severe_violations_entity_id_3month_failed::int_avg_imp, severe_violations_entity_id_3month_failed::int_max, severe_violations_entity_id_3month_failed::int_max_imp, severe_violations_entity_id_3month_failed::int_sum, severe_violations_entity_id_3month_failed::int_sum_imp, severe_violations_entity_id_3year_failed::int_avg, severe_violations_entity_id_3year_failed::int_avg_imp, severe_violations_entity_id_3year_failed::int_max, severe_violations_entity_id_3year_failed::int_max_imp, severe_violations_entity_id_3year_failed::int_sum, severe_violations_entity_id_3year_failed::int_sum_imp, severe_violations_entity_id_6month_failed::int_avg, severe_violations_entity_id_6month_failed::int_avg_imp, severe_violations_entity_id_6month_failed::int_max, severe_violations_entity_id_6month_failed::int_max_imp, severe_violations_entity_id_6month_failed::int_sum, severe_violations_entity_id_6month_failed::int_sum_imp, failed_inspection]
Index: []

@thcrock thcrock self-assigned this Feb 4, 2019
thcrock added a commit that referenced this issue Feb 5, 2019
Pandas' read_hdf does not implement a method for only loading the first
n rows that works reliably on different datasets, particularly with multi-indices.
This is problematic because head_of_matrix is how the MatrixStore
implements empty() and columns() methods.

Instead of implementing head_of_matrix, we implement
empty() and columns() on HDFStore directly, obviating the need for a
head_of_matrix method that only reads the first row.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants