Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to combine H5 dat files #6298

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

knelli2
Copy link
Contributor

@knelli2 knelli2 commented Sep 19, 2024

Proposed changes

Towards #6246.

This function is distinct from the existing CLI function spectre combine-h5 dat because this takes care of overlapping times in the dat subfiles (we could optionally replace the python function with this one if we bind this function). Also, this is a C++ function and not a python function because it will be used in the ReduceCceWorldtube executable which must be statically linked.

Upgrade instructions

Code review checklist

  • The code is documented and the documentation renders correctly. Run
    make doc to generate the documentation locally into BUILD_DIR/docs/html.
    Then open index.html.
  • The code follows the stylistic and code quality guidelines listed in the
    code review guide.
  • The PR lists upgrade instructions and is labeled bugfix or
    new feature if appropriate.

Further comments

@knelli2 knelli2 mentioned this pull request Sep 19, 2024
7 tasks
@knelli2 knelli2 force-pushed the h5_dat_combine branch 2 times, most recently from 7c0a988 to a7d1cef Compare September 20, 2024 00:09
Copy link
Member

@nilsdeppe nilsdeppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read all the comments first. I give a few different options that I think would all resolve my concerns and I don't have a preference for which to do :)

// in the sequence of files since we are looping backward) is before
// any of the times in this file. If so, don't include those times.
std::optional<size_t> row = times.rows() - 1;
while (times(row.value(), 0) >= earliest_time.value()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we guarantee that our HDF5 Dat output is actually sorted in time. I think you need to first sort the matrix by the first column before combining. I think what you need to do is store the index of the sorted matrix, and then sort again in the next for loop below.

// the number of times and the earliest time of this file
if (not earliest_time.has_value()) {
num_time_map.at(dat_filename)[index] = dimensions[0];
earliest_time = times(0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not guarantee to be the earliest time. It's the first time written, but that doesn't have to be the earliest time, I think. I believe our observers do not enforce ordering on write.

@@ -149,4 +157,186 @@ void combine_h5_vol(const std::vector<std::string>& file_names,
new_file.close_current_object();
}
}

void combine_h5_dat(const std::vector<std::string>& h5_files_to_combine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you assume that the files are in increasing order in time. It would be good to add a check for that since otherwise users may be very surprised. Another option would be to just sort the list in increasing order for the subfile(s).

// Only append data if we include data from this file
if (num_times.has_value()) {
// Always start with row 0
const Matrix data_to_append = input_dat_file.get_data_subset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to load the entire matrix, sort it, and then trim it before write.

Comment on lines 26 to 28
* will be ignored and will not appear in \p output_h5_filename. This function
* also assumes that the times in each of the \p h5_files_to_combine are already
* sorted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this second assumption is generally not valid for spectre, unfortunately. If you would like to keep it, I think you should add a check that it's true because someone will inevitably run this over a file where it's not true.

Comment on lines 21 to 23
* meaning if you have data in `File1.h5` and `File2.h5` and if the first time
* in `File2.h5` is before some times in `File1.h5`, those times in `File1.h5`
* will be discarded and won't appear in the combined H5 file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the constraint is quite strong and should be explicit: the files past must be in increasing time order.

if (file_system::check_if_file_exists(filename)) {
file_system::rm(filename, true);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add tests for catching unordered files passed in, and for unordered data in the subfiles. I worry that these will both be common mistakes.

This function also handles overlapping times by taking the "latest"
data always.
Now they can return either a Matrix or a vector<vector<double>>
@knelli2
Copy link
Contributor Author

knelli2 commented Sep 27, 2024

I decided to simply assert that the H5 files were monotonically increasing in their earliest time in the files. I think something else should be responsible for actually putting the H5 files in order properly. This also allows for the times in each dat file to be unordered. One downside is that we have to read all data in from a dat file first, sort it, then trim the overlapping times we don't need. But that shouldn't be too bad.

To facilitate this sorting (because a Matrix can't be sorted easily) I just added overloads to the dat.get_data() to return the data as std::vector<std::vector<double>> which can be sorted very easily. These are the two new commits before the fixup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants