Add function to combine H5 dat files #6298

knelli2 · 2024-09-19T23:38:00Z

Proposed changes

Towards #6246.

This function is distinct from the existing CLI function spectre combine-h5 dat because this takes care of overlapping times in the dat subfiles (we could optionally replace the python function with this one if we bind this function). Also, this is a C++ function and not a python function because it will be used in the ReduceCceWorldtube executable which must be statically linked.

Upgrade instructions

Code review checklist

The code is documented and the documentation renders correctly. Run
make doc to generate the documentation locally into BUILD_DIR/docs/html.
Then open index.html.
The code follows the stylistic and code quality guidelines listed in the
code review guide.
The PR lists upgrade instructions and is labeled bugfix or
new feature if appropriate.

Further comments

nilsdeppe

Please read all the comments first. I give a few different options that I think would all resolve my concerns and I don't have a preference for which to do :)

nilsdeppe · 2024-09-23T18:09:41Z

src/IO/H5/CombineH5.cpp

+      // in the sequence of files since we are looping backward) is before
+      // any of the times in this file. If so, don't include those times.
+      std::optional<size_t> row = times.rows() - 1;
+      while (times(row.value(), 0) >= earliest_time.value()) {


I'm not sure that we guarantee that our HDF5 Dat output is actually sorted in time. I think you need to first sort the matrix by the first column before combining. I think what you need to do is store the index of the sorted matrix, and then sort again in the next for loop below.

nilsdeppe · 2024-09-23T18:12:35Z

src/IO/H5/CombineH5.cpp

+      // the number of times and the earliest time of this file
+      if (not earliest_time.has_value()) {
+        num_time_map.at(dat_filename)[index] = dimensions[0];
+        earliest_time = times(0, 0);


This is not guarantee to be the earliest time. It's the first time written, but that doesn't have to be the earliest time, I think. I believe our observers do not enforce ordering on write.

nilsdeppe · 2024-09-23T18:14:20Z

src/IO/H5/CombineH5.cpp

@@ -149,4 +157,186 @@ void combine_h5_vol(const std::vector<std::string>& file_names,
    new_file.close_current_object();
  }
 }
+
+void combine_h5_dat(const std::vector<std::string>& h5_files_to_combine,


I think you assume that the files are in increasing order in time. It would be good to add a check for that since otherwise users may be very surprised. Another option would be to just sort the list in increasing order for the subfile(s).

nilsdeppe · 2024-09-23T18:16:09Z

src/IO/H5/CombineH5.cpp

+      // Only append data if we include data from this file
+      if (num_times.has_value()) {
+        // Always start with row 0
+        const Matrix data_to_append = input_dat_file.get_data_subset(


This will need to load the entire matrix, sort it, and then trim it before write.

nilsdeppe · 2024-09-23T18:17:18Z

src/IO/H5/CombineH5.hpp

+ * will be ignored and will not appear in \p output_h5_filename. This function
+ * also assumes that the times in each of the \p h5_files_to_combine are already
+ * sorted.


I think this second assumption is generally not valid for spectre, unfortunately. If you would like to keep it, I think you should add a check that it's true because someone will inevitably run this over a file where it's not true.

nilsdeppe · 2024-09-23T18:17:57Z

src/IO/H5/CombineH5.hpp

+ * meaning if you have data in `File1.h5` and `File2.h5` and if the first time
+ * in `File2.h5` is before some times in `File1.h5`, those times in `File1.h5`
+ * will be discarded and won't appear in the combined H5 file.


I think the constraint is quite strong and should be explicit: the files past must be in increasing time order.

nilsdeppe · 2024-09-23T18:20:43Z

tests/Unit/IO/H5/Test_CombineH5.cpp

+    if (file_system::check_if_file_exists(filename)) {
+      file_system::rm(filename, true);
+    }
+  }


It would be good to add tests for catching unordered files passed in, and for unordered data in the subfiles. I worry that these will both be common mistakes.

This function also handles overlapping times by taking the "latest" data always.

Now they can return either a Matrix or a vector<vector<double>>

knelli2 · 2024-09-27T23:49:14Z

I decided to simply assert that the H5 files were monotonically increasing in their earliest time in the files. I think something else should be responsible for actually putting the H5 files in order properly. This also allows for the times in each dat file to be unordered. One downside is that we have to read all data in from a dat file first, sort it, then trim the overlapping times we don't need. But that shouldn't be too bad.

To facilitate this sorting (because a Matrix can't be sorted easily) I just added overloads to the dat.get_data() to return the data as std::vector<std::vector<double>> which can be sorted very easily. These are the two new commits before the fixup

knelli2 requested a review from nilsdeppe September 19, 2024 23:38

knelli2 mentioned this pull request Sep 19, 2024

ICERM CCE Feedback #6246

Open

7 tasks

knelli2 force-pushed the h5_dat_combine branch 2 times, most recently from 7c0a988 to a7d1cef Compare September 20, 2024 00:09

nilsdeppe requested changes Sep 23, 2024

View reviewed changes

knelli2 force-pushed the h5_dat_combine branch from a7d1cef to 1578987 Compare September 27, 2024 23:35

knelli2 added 3 commits September 27, 2024 16:39

Rename combine_h5 to combine_h5_vol

0eedca2

Add function to combine H5 dat files

8f7bb08

This function also handles overlapping times by taking the "latest" data always.

Reorganize an H5 helper

aa42e7b

knelli2 force-pushed the h5_dat_combine branch from 1578987 to 5935c42 Compare September 27, 2024 23:44

knelli2 added 2 commits September 27, 2024 16:47

Add overload to dat and cce get_data functions

e9f4bcf

Now they can return either a Matrix or a vector<vector<double>>

fixup. Combine H5 dat

a67d874

knelli2 force-pushed the h5_dat_combine branch from 5935c42 to a67d874 Compare September 27, 2024 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to combine H5 dat files #6298

Add function to combine H5 dat files #6298

knelli2 commented Sep 19, 2024

nilsdeppe left a comment

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

nilsdeppe Sep 23, 2024

knelli2 commented Sep 27, 2024 •

edited

Loading

Add function to combine H5 dat files #6298

Are you sure you want to change the base?

Add function to combine H5 dat files #6298

Conversation

knelli2 commented Sep 19, 2024

Proposed changes

Upgrade instructions

Code review checklist

Further comments

nilsdeppe left a comment

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

nilsdeppe Sep 23, 2024

Choose a reason for hiding this comment

knelli2 commented Sep 27, 2024 • edited Loading

knelli2 commented Sep 27, 2024 •

edited

Loading