Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the archive exporter #4448

Merged
merged 23 commits into from
Oct 23, 2020

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Oct 16, 2020

This PR builds on top of #4446

The refactor fully decouples the extraction of data from the database, from the writing of that data to an archive.

It provides an abstract writer class, then each concrete writer is identified by a string (this could in fact be made into an entry point)

This code is backwards incompatible, in terms of the Python API, in that it removes the extract_zip and export_tar functions, but should still work with the verdi CLI, which only calls the export function, whose signature remains the same.

The commit also adds a backward compatible install requirement, dataclasses, which is included in python core from python 3.7 onwards.

The tests have not yet been altered to accommodate for these changes.

This PR is part of a set that aims to decouple the generation of data to export, from the actual writing of the archive; as a prelude to introducing a new archive format.

This particular PR does not introduce any breaking changes, it:

- splits the amazingly complex `export_tree` function into a number of separate, self-contained, functions
- adds mypy compliant typing
- fixes an bug whereby, in python 3.8, `logging.disable(level=` has changed to `logging.disable(lvl=`
This refactor fully decouples the extraction of data from the database, from the writing of that data to an archive.

It provides an abstract writer class, then each concrete writer is identified by a string (this could in fact be made into an entry point)

This code is backwards incompatible, in terms of the Python API, in that it removes the `extract_zip` and `export_tar` functions, but should still work with the verdi CLI, which only calls the `export` fcuntion, whose signature remains the same.

The commit also adds a backward compatible install requirement, `dataclasses`, which is included in python core from python 3.7 onwards.

The tests have not yet been altered to accomodate for these changes.
@chrisjsewell
Copy link
Member Author

@sphuber this is the second part of my master plan 😄

@chrisjsewell
Copy link
Member Author

Note the last thing I plan to do (maybe in a separate PR) is to decouple the ~hard-coded progress bar implementation into a more abstract (possible callback-like) mechanism

@chrisjsewell chrisjsewell changed the title ♻️ Refactor archive exporter ♻️ Refactor the archive exporter Oct 16, 2020
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @chrisjsewell

there are a couple of minor comments, but mainly naming issues, nothing major.
I will now check out the PR and do a local test, to see whether performance remains the same.

aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Show resolved Hide resolved
group_uuids: Dict[str, List[str]]
link_uuids: List[dict]
# all entity data from the database, except Node extras and attributes
entity_data: Dict[str, Dict[int, dict]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why this does not need to be iterable?

Copy link
Member Author

@chrisjsewell chrisjsewell Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I've just initially copied what was being used previously.
It probably could be though.
node_data was the main one I wanted to make iterable, because that will be the one that takes up the most memory. I actually want to make it a generator, but just need to make sure that works with the progress bar updates

aiida/tools/importexport/dbexport/__init__.py Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
@ltalirz ltalirz marked this pull request as ready for review October 22, 2020 13:51
@ltalirz
Copy link
Member

ltalirz commented Oct 22, 2020

I now did some tests exporting.

Performance seems unaffected (had no reason to think it was, just checking to be safe) - there is just one minor detail: I think the final progress bar is no longer being removed.

old

Exporting a total of 13728 database entries, of which 9421 are Nodes.
Success: wrote the export archive file to develop.aiida

new

Exporting a total of 13728 database entries, of which 9421 are Nodes.
Exporting repository - UUID=e65d7e5d      99.9%|███████████████████████████████████████████████████████████▉| 9412/9421
Success: wrote the export archive file to chris.aiida

@sphuber @giovannipizzi fyi - I just want to reiterate that (on the server I'm testing on, with a magnetic disk), the time spent on exporting the file repository is strongly dominating overall export time.

Exporting the same set of nodes twice in a row, I am seeing huge differences in the timings for the export of the repository just from the disk cache - with the total export time going down by a factor 6 from 192s when running it the first time to ~31s the second time.
Very much looking forward to the improvements brought by the new repository implementation :-)

@chrisjsewell
Copy link
Member Author

there is just one minor detail: I think the final progress bar is no longer being removed.

yep already got a fix for that 👍

@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #4448 into develop will increase coverage by 0.06%.
The diff coverage is 97.96%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4448      +/-   ##
===========================================
+ Coverage    79.31%   79.37%   +0.06%     
===========================================
  Files          475      475              
  Lines        34840    34930      +90     
===========================================
+ Hits         27631    27722      +91     
+ Misses        7209     7208       -1     
Flag Coverage Δ
#django 73.23% <97.96%> (+0.07%) ⬆️
#sqlalchemy 72.46% <97.96%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/tools/importexport/dbexport/__init__.py 98.06% <97.94%> (+0.34%) ⬆️
aiida/tools/importexport/dbexport/utils.py 82.27% <100.00%> (-0.83%) ⬇️
aiida/transports/plugins/local.py 82.57% <0.00%> (-0.25%) ⬇️
aiida/engine/daemon/client.py 73.57% <0.00%> (+1.15%) ⬆️
aiida/engine/daemon/runner.py 82.76% <0.00%> (+3.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 228716f...7206fdb. Read the comment docs.

ltalirz
ltalirz previously approved these changes Oct 22, 2020
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @chrisjsewell !

from my side this is fine, just two minor comments. you can fix them or not - you'll need to update the PR anyhow before merging.

aiida/tools/importexport/dbexport/__init__.py Outdated Show resolved Hide resolved
@@ -475,9 +492,9 @@ def export(
**traversal_rules
)

output_data = {}
report_data: Dict[str, Any] = {'time_write_start': None, 'time_write_stop': None, 'writer_data': None}
Copy link
Member

@ltalirz ltalirz Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity - is this the suggested way to build instances of dataclasses, i.e. to build a dict and then pass it to the constructor?

E.g. couldn't you just initialize it below with
report = ExportReport(time_collect_start=..., time_collect_stop=...)
and then just add to the instance?

It's not very important here, but in general it would avoid retyping the attributes of the data class as dict keys (possibly introducing typos).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well you would have to intialise it with a dummy time_collect_start etc, then assume that it will get overriden later on, which I don't think is a good idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you could just initialize it once you have both collect times

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah potatas, potatas, both are valid 😉

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Oct 22, 2020

ok, so turns out the summary of traversal rules output by the verdi export create was completely wrong lol (can't believe no one had noticed), so in 9151ab3 I have fixed that by ensuring the rule set and defaults are taken directly from the GraphTraversalRules enum.

Then to test (for those interested but also a note to self):

  1. I fired up a docker-compose file (put a .dockerignore in the repo to ignore the .tox folder):
version: '3.4'
services:
  aiida:
    build:
      context: ../
      dockerfile: Dockerfile
    image: "aiidateam/aiida-core:latest"
    container_name: aiida-core
    networks:
      - aiida-core
    volumes:
      - "../:/tmp/mount_folder"

networks:
  aiida-core:
$ docker-compose -f .ci/docker-aiida-core.yml up -d --build
$ docker exec -it --user aiida aiida-core /bin/bash
  1. Used my recursive tree builder (in tests/benchmark/test_importexport.py) to generate a 40,000 node tree (I CTRL-C'd early)
root = orm.Dict()
recursive_provenance(root, depth=9, breadth=3, num_objects=1)
  1. Run the CLI
$ verdi export create --input-calc-forward -N 1 -- tmp/mount_folder/out.aiida
-
EXPORT
--------------  --------------------------
Archive         tmp/mount_folder/out.aiida
Format          Zip (compressed)
Export version  0.9

Inclusion rules
-----------------  ----
Include Comments   True
Include Logs       True

Traversal rules
---------------------------------  -----
Follow links input calc forwards   True
Follow links input calc backwards  True
Follow links create forwards       True
Follow links create backwards      True
Follow links return forwards       True
Follow links return backwards      False
Follow links input work forwards   False
Follow links input work backwards  True
Follow links call calc forwards    True
Follow links call calc backwards   True
Follow links call work forwards    True
Follow links call work backwards   True

Exporting a total of 40066 database entries, of which 40065 are Nodes.                                                                                                                                            
Success: wrote the export archive file to tmp/mount_folder/out.aiida 

@ltalirz
Copy link
Member

ltalirz commented Oct 22, 2020

hehe, I think this warrants opening an issue to be closed by this PR ;-)

@ltalirz ltalirz self-requested a review October 22, 2020 23:21
@chrisjsewell
Copy link
Member Author

hehe, I think this warrants opening an issue to be closed by this PR ;-)

lol do we get issue closing related pay

@ltalirz
Copy link
Member

ltalirz commented Oct 22, 2020

just as a record that this issue existed ;-)

ltalirz
ltalirz previously approved these changes Oct 22, 2020
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this if the tests pass.

Please restrain yourself from emojis in the final commit message for the time being ;-)

@chrisjsewell
Copy link
Member Author

👎 😢 😭

@chrisjsewell
Copy link
Member Author

Honestly they do make the commit log so much clearer though (but I'll refrain)

@chrisjsewell
Copy link
Member Author

Ah got some autodoc nitpicks to fix:

/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/dbexport/__init__.py:docstring of aiida.tools.importexport.ArchiveWriterAbstract.write:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ArchiveData
/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/__init__.py:docstring of aiida.tools.importexport.export:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ExportReport
/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/dbexport/__init__.py:docstring of aiida.tools.importexport.dbexport.ArchiveWriterAbstract.write:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ArchiveData
/home/circleci/.local/lib/python3.7/site-packages/aiida/tools/importexport/dbexport/__init__.py:docstring of aiida.tools.importexport.dbexport.export:: WARNING: py:class reference target not found: aiida.tools.importexport.dbexport.ExportReport

I believe this is due to the order of processing (i.e. the object referencing it is processed before the object being referenced). Must be a sphinx v3 regression, I'll open an issue on sphinx about it

@chrisjsewell chrisjsewell changed the title ♻️ Refactor the archive exporter Refactor the archive exporter Oct 23, 2020
@chrisjsewell chrisjsewell merged commit c6d38c1 into aiidateam:develop Oct 23, 2020
@chrisjsewell chrisjsewell deleted the export/refactor2 branch October 23, 2020 00:12
@ltalirz ltalirz self-requested a review October 23, 2020 07:48
chrisjsewell added a commit that referenced this pull request Nov 12, 2020
This PR builds on #4448,
with the goal of improving both the export writer API
(allowing for "streamed" data writing)
and performance of the export process (CPU and memory usage).

The writer is now used as a context manager,
rather than passing all data to it after extraction of the data from the AiiDA database.
This means it is called throughout the export process,
and will allow for less data to be kept in RAM when moving to a new archive format.

The number of database queries has also been reduced, resulting in a faster process.

Lastly, code for read/writes to the archive has been moved to the https://github.com/aiidateam/archive-path package.
This standardises the interface for both zip and tar, and
especially for export to tar, provides much improved performance,
since the data is now written directly to the archive
(rather than writing to a folder then only compressing at the end).

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
@csadorf
Copy link
Contributor

csadorf commented Nov 16, 2020

I would appreciate if I would be kept in the loop when we add dependencies to aiida-core.

@ltalirz
Copy link
Member

ltalirz commented Nov 16, 2020

thanks @csadorf , @chrisjsewell and me will remember it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants