Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link FITS_rec instances to hdu extensions on save #178

Merged
merged 1 commit into from
Jul 10, 2023

Conversation

braingram
Copy link
Collaborator

@braingram braingram commented Jul 7, 2023

DataModel converts recarrays to FITS_rec on assignment these were not correctly linked to FITS extensions on save resulting in duplication of FITS_rec data on save.

This adds a test for the duplication and some book keeping to allow linking the converted FITS_recs and extensions on save.

Astropy creates a 'view' of the FITS_rec when assigned to a new FITS extension which makes a simple 'is' check complicated (so the strategy used for ndarray and NDArrayType could not be used). Instead, the links between FITS_rec and extensions are tracked (via the FitsContext) which allows creation of the links without having re-find the hdu created from the FITS_rec. (The existing strategy is left in place for ndarray to maintain compatibility with existing ASDF-in-FITS behavior which does not require the same schema structure as the DataModels).

Resolves JP-3277

Regression tests running: https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FJWST-Developers-Pull-Requests/detail/JWST-Developers-Pull-Requests/776/pipeline/202/

Docs failures are unrelated (see: #180)

Checklist

  • added entry in CHANGES.rst (either in Bug Fixes or Changes to API)
  • updated relevant tests
  • updated relevant documentation
  • updated relevant milestone(s)
  • added relevant label(s)

@braingram
Copy link
Collaborator Author

braingram commented Jul 7, 2023

I'm seeing a few failures in the regression tests:
https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FJWST-Developers-Pull-Requests/detail/JWST-Developers-Pull-Requests/776/tests

[stable-deps] test_nircam_tsgrism_stage2[calints] – jwst.regtest.test_nircam_tsgrism1m 14s
[stable-deps] test_nircam_tsgrism_stage2[extract_2d] – jwst.regtest.test_nircam_tsgrism1s
[stable-deps] test_nircam_tsgrism_stage2[flat_field] – jwst.regtest.test_nircam_tsgrism2s
[stable-deps] test_nircam_tsgrism_stage2[o012_crfints] – jwst.regtest.test_nircam_tsgrism<1s
[stable-deps] test_nircam_tsgrism_stage2[srctype] – jwst.regtest.test_nircam_tsgrism1s
[stable-deps] test_nircam_tsgrism_stage2[x1dints] – jwst.regtest.test_nircam_tsgrism<1s
[stable-deps] test_nircam_tsgrism_stage3_x1dints – jwst.regtest.test_nircam_tsgrism

@hbushouse @tapastro are these expected failures?

An example error is:

     Headers contain differences:
       Headers have different number of cards:
        a: 228
        b: 234
       Extra keyword 'R_DFLAT' in b: 'N/A'
       Extra keyword 'R_FFLAT' in b: 'N/A'
       Extra keyword 'R_SFLAT' in b: 'N/A'
       Inconsistent duplicates of keyword ''      :
        Occurs 41 time(s) in a, 44 times in (b)
       Keyword         [24] has different values:
          a> Nirspec FORE Model reference file information
           ? ^^^^^^^^ --------
          b> DFlat reference file information
           ? ^  ++
       Keyword         [25] has different values:
          a> Nirspec FPA Model reference file information
           ? -------- ^^^^^^^
          b> FFlat reference file information
           ?  ^ ++
       Keyword         [26] has different values:
          a> Gain reference file information
           ? ^ ^^
          b> SFlat reference file information
           ? ^^^ ^
       Keyword         [27] has different values:
          a> IFU fore reference file information
           ? ^ ^^^ ^
          b> Nirspec FORE Model reference file information
           ? ^^^^^^^^ ^^^^^ ^ +
       Keyword         [28] has different values:
          a> IFU post reference file information
          b> Nirspec FPA Model reference file information
       Keyword         [29] has different values:
          a> IFU slicer reference file information
           ? ^^^^^^ ^^^
          b> Gain reference file information
           ? ^^ ^
       Keyword         [30] has different values:
          a> Linearity reference file information
           ? ^^^ -----
          b> IFU fore reference file information
           ? ^^^^^^^
       Keyword         [31] has different values:
          a> Mask reference file information
           ? ^^ ^
          b> IFU post reference file information
           ? ^^^^^^ ^
       Keyword         [32] has different values:
          a> Nirspec MSA Model reference file information
          b> IFU slicer reference file information
       Keyword         [33] has different values:
          a> Nirspec OTE Model reference file information
          b> Linearity reference file information
       Keyword         [34] has different values:
          a> Photometric reference file information
           ? ^^^^^^^^^^^
          b> Mask reference file information
           ? ^^^^
       Keyword         [35] has different values:
          a> Read noise reference file information
          b> Nirspec MSA Model reference file information
       Keyword         [36] has different values:
          a> Regions reference file information
          b> Nirspec OTE Model reference file information
       Keyword         [37] has different values:
          a> Saturation reference file information
           ? ^^ ^ -- ^^
          b> Photometric reference file information
           ? ^^^ ^^^^  ^
       Keyword         [38] has different values:
          a> Spectral distortion reference file information
           ? ^^ --- --   ^^^^^^^
          b> Read noise reference file information
           ? ^   +++  ^
       Keyword         [39] has different values:
          a> Superbias reference file information
           ? ^^^ ^^ ^
          b> Regions reference file information
           ? ^ ^ ^^
       Keyword         [40] has different values:
          a> Wavelength Range reference file information
          b> Saturation reference file information
       Keyword         [41] has different values:
          a> Calibration step information
          b> Spectral distortion reference file information

@braingram braingram marked this pull request as ready for review July 7, 2023 20:54
@braingram braingram requested a review from a team as a code owner July 7, 2023 20:54
@hbushouse
Copy link
Collaborator

The regtest errors are expected only in the sense that they've been happening for a while now, but we can't figure out why. But regardless, they have nothing to do with this PR, so you're good.

@braingram
Copy link
Collaborator Author

I think this should address JP-3277 but it probably makes sense to test this PR on one of the failures mentioned on the ticket. However, I don't know how to do that :) I'm happy to learn (if that makes sense) but would need to know how to find the data and run the failing step.

@tapastro
Copy link
Collaborator

tapastro commented Jul 7, 2023

I can give it a try. I'll install this PR version of stdatamodels and test it out. But if you want to try, feel free to install jwst in your environment, and you may be able to follow the repo Wiki for regression testing locally and use the tso regression test data to verify?

@codecov
Copy link

codecov bot commented Jul 8, 2023

Codecov Report

Patch coverage: 94.44% and project coverage change: +0.04 🎉

Comparison is base (065ec6e) 63.90% compared to head (9d30b03) 63.95%.

❗ Current head 9d30b03 differs from pull request most recent head eb516d5. Consider uploading reports for the commit eb516d5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #178      +/-   ##
==========================================
+ Coverage   63.90%   63.95%   +0.04%     
==========================================
  Files         101      101              
  Lines        5555     5565      +10     
==========================================
+ Hits         3550     3559       +9     
- Misses       2005     2006       +1     
Impacted Files Coverage Δ
src/stdatamodels/fits_support.py 90.27% <94.44%> (-0.01%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@braingram
Copy link
Collaborator Author

I ran the test_niriss_soss_stage3_crfints regtest locally (this isn't a perfect fit for the runs mentioned in the ticket which appear to use NIRSPEC data but this test does run the 'calwebb_tso3' pipeline).

Looking at the data during test with and without the changes in this PR...

Without this PR, jw01091002001_03101_00001-seg001_nis_short_x1dints.fits, contains duplicated data (30 EXTRACT1D and 1 INT_TIMES extensions that are not referenced from the ASDF extension which contains 31 embedded binary blocks).

With this PR the same files contains the same EXTRACT1D and INT_TIMES extensions however these are linked in the ASDF extension (which contains no binary blocks).

DataModel converts recarrays to FITS_rec on assignment
these were not correctly linked to FITS extensions on
save resulting in duplication of FITS_rec data on save.

This adds a test for the duplication and some book keeping
to allow linking the converted FITS_recs and extensions
on save.

Astropy creates a 'view' of the FITS_rec when assigned
to a new FITS extension which makes a simple 'is' check
complicated (so the strategy used for ndarray and NDArrayType
could not be used). Instead, the links between FITS_rec
and extensions are tracked (via the FitsContext) which allows
creation of the links without having re-find the hdu
created from the FITS_rec. (The existing strategy is left
in place for ndarray to maintain compatibility with existing
ASDF-in-FITS behavior which does not require the same schema
structure as the DataModels).
Copy link
Collaborator

@hbushouse hbushouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@braingram braingram merged commit 9281ec1 into spacetelescope:master Jul 10, 2023
15 checks passed
@braingram braingram deleted the fits_rec_duplication branch July 10, 2023 15:56
@hbushouse
Copy link
Collaborator

@tapastro Have you had a chance to do independent testing of this fix?

@tapastro
Copy link
Collaborator

I ran a single spec2 association which used to produce a 440MB x1dints file, where the asdf extension was ~223 MB. This code produced a 250MB file with a 32MB asdf extension. Seems like it did a good job! I don't understand how to check if any of the arrays are linked or reproduced in the extension, though.

@braingram
Copy link
Collaborator Author

Thanks for checking! That definitely looks promising.

There's no 'official' way to check the links. What I've been doing is similar to what is in the test added with this PR:

# open the model and confirm that the table was linked to an hdu
with fits.open(file_path) as ff:
# read the bytes for the embedded ASDF content
asdf_bytes = ff['ASDF'].data.tobytes()
# get only the bytes for the tree (not blocks) by splitting
# on the yaml end document marker '...'
# on the first block magic sequence
tree_string = asdf_bytes.split(b'...')[0].decode('ascii')
unlinked_arrays = re.findall(r'source:\s+[^f]', tree_string)
assert not len(unlinked_arrays), unlinked_arrays

In brief, when stdatamodels links an item in the ASDF tree to a fits extension it changes the data 'source' stored in the ASDF tree to a string like fits:EXTRACT1D,1. This differs from a non-linked array/table which will have a 'source' that is a number corresponding to the index of the ASDF binary block (in this case embedded in the ASDF extension).

So a normal array will appear something like this in the ASDF yaml.

lookup_table: !core/ndarray-1.0.0
  source: 0
  datatype: float32
  byteorder: big
  shape: [2048, 2048]

Whereas a linked array will appear something like:

dq: !core/ndarray-1.0.0
  source: fits:DQ,1
  datatype: uint32
  byteorder: big
  shape: [10, 256, 2048]

(note that the above 2 examples were both taken from the same '*_nis_short_calints.fits' file so a file can have both linked and non-linked arrays/tables, in this case the non-linked arrays are related to the wcs).

Let me know if more details would be helpful.

@tapastro
Copy link
Collaborator

Ah, in that case it's not quite going right - I see 6618 unlinked arrays, and doing a findall on "source:" shows there are 9928 total arrays in the file. I can move this to somewhere public for you to take a look, if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants