Skip to content

Parallel ID Issues

Greg Sjaardema edited this page Jan 25, 2022 · 4 revisions

[Saving an email response to a customer who is seeing confusing output from serial/parallel runs. Will try to expand this later into a better explanation]

The map issue with the decomposed/recomposed files is as follows:

  • Assume input mesh has node and/or element map
  • On decomposition, the maps are hi-jacked in the spread files and used instead to relate local implicit (1..N) nodes/elements in the spread files back to the implicit serial nodes
  • For example, if we decompose a 5-node serial mesh (10…20…30…40…50) to two processors:
    • P0 has first three nodes, so its map will be 1…2…3
    • P1 has last three nodes, so its map will be 3…4…5
    • Note that the original node map in the serial file is gone at this point. This problem has existed in nem_spread since the beginning of its existence.
  • To try to retain the original serial mesh node map, additional node/element maps named original_global_id_map were added at some point several years ago to nem_spread
    • P0 will have the entries “10, 20, 30” in this map
    • P1 will have the entries “30, 40, 50” in this map.
  • If an application reads the original_global_id_map, then it can provide the user with the same node/element ids in parallel as in serial. If the application does not read the original_global_id_map, then the node/element ids will be the serial mesh implicit node ids.
  • In Sierra, in the “pre-original_global_id_map” times, there was an option to ignore the node/element maps in serial runs so that the node/element ids would match in serial and parallel runs.
  • I then added output of the original_global_id_map capability to nem_spread and reading the maps to IOSS and life was good again…
  • [Fixed in EPU-5.0] EPU does not read the original_global_id_map, so if you do a decomp followed by an immediate epu, the resulting mesh will only have the 1…N map and the original node/element id map will be lost.
    • An exodiff at this point will give a error since original mesh had 10,20,30,40,50 and epu’d mesh has 1,2,3,4,5
  • IOSS combines the global->serial implicit node map with the original_global_id_map and presents the client application with the doubly mapped ids. (10,20,30,40,50 in the example above)
    • On output, the files will have the 10,20,30 and 30,40,50 map and epu will create an output file with 10,20,30,40,50 map.
    • Exodiff at this point will work since original and epu’d mesh have same 10,20,30,40,50 map

If the original file has a node map that is a permutation of 1..#nodes, then there will probably be confusion following an epu since the original and epu'd file will look similar, but the ids will be scrambled.

  • For simplicity we will reduce this down to a 5 node mesh. Assume map is 1,5,2,4,3
  • Decompose to 2 processors.
    • P0 has first three nodes, so its map will be 1…2…3
    • P1 has last three nodes, so its map will be 3…4…5
    • Note that the original node map in the serial file is gone at this point.
  • The original_global_id_map will have:
    • P0 will have the entries “1,5,2”
    • P1 will have the entries “2,4,3”
  • If the application does not read the original_global_id_map, on output the maps are:
    • P0 has 1,2,3
    • P1 has 3,4,5
  • EPU rejoins the files. Finds all 1..5 nodes existing in output file, so puts the nodes in order and doesn’t output the 1,2,3,4,5 map since it can be regenerated implicitly.
  • EXODIFF finds original file has nodes 1,5,2,4,3 and new file has 1,2,3,4,5.
    • It maps 1..1, 2..2, 3..3, 4..4, 5..5 and finds that they have different coordinates
    • Outputs a “metadata mismatch error/warning”. [I need to check why it says “metadata” in this error message]
    • If I do a “exodiff –match_file_order old.g new.g”, I should get a valid “no difference” output.