SOAP I/O optimization #4

jchelly · 2023-01-31T15:21:15Z

SOAP seems to put a big load on the Lustre file system on Cosma so we aren't able to run many instances at the same time. With recent improvements in HDF5 we might be able to fix this.

SOAP splits the simulation volume into chunks and puts the chunks into a queue. When a compute node is idle it takes the next chunk from the queue and computes properties of the halos in the chunk. This continues until all chunks have been done. To read the particles for a chunk, the MPI ranks within the compute node all execute independent reads of the relevant parts of the input files. This is probably not very file system friendly.

We could instead have each node iterate through the files it needs to read and do a collective read on each one. This should use MPI-IO to funnel I/O through a few aggregator ranks, reducing the load on the file system while still having all ranks work on decompressing the data in HDF5.

Until recently this didn't work. HDF5 before 1.14 falls back to independent reads if you use any filters, and metadata operations always use independent I/O. But 1.14 can do collective reads and writes of filtered datasets and it also has an option to do metadata I/O using collective operations. If I use romio as the MPI-IO layer and enable lazy file opening (romio_no_indep_rw=true) then I can use collective HDF5 calls to read datasets while having only one rank per node actually touch the file system.

There's also a new function H5Dread_multi, which allows you to designate multiple datasets to read and which elements to read from them and then it reads everything in a single MPI-IO call. From some tests in C that seems to perform fairly well even with very few ranks accessing the file system. The down side is that H5Dread_multi doesn't exist in h5py and might be non-trivial to implement.

jchelly · 2024-06-05T09:48:10Z

Collective I/O with HDF5 still doesn't seem to scale very well, so here's another approach:

When all MPI ranks on a node need to read a chunked, compressed dataset into shared memory, have all ranks create an empty copy of the dataset in memory with the core driver. This will initially have no allocated chunks. A small number of ranks (maybe just one) are designated as readers. These use H5Dread_chunk to read raw chunk data and distribute it. All ranks then H5Dwrite_chunk their raw chunks to their in-memory datasets then read the contents back (thereby invoking the HDF5 filter pipeline to decode the data) and copy the result to the shared output buffer. This would let us have just a few ranks touching the disks while all ranks work on decompression. In case of uncompressed datasets the reader ranks could just copy the contents directly to the output buffer.

jchelly added the enhancement New feature or request label Jan 31, 2023

jchelly self-assigned this Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOAP I/O optimization #4

SOAP I/O optimization #4

jchelly commented Jan 31, 2023

jchelly commented Jun 5, 2024

SOAP I/O optimization #4

SOAP I/O optimization #4

Comments

jchelly commented Jan 31, 2023

jchelly commented Jun 5, 2024