Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout on big files #197

Closed
wurDevTim opened this issue Jun 20, 2024 · 6 comments · Fixed by #204
Closed

timeout on big files #197

wurDevTim opened this issue Jun 20, 2024 · 6 comments · Fixed by #204
Assignees
Milestone

Comments

@wurDevTim
Copy link
Collaborator

*Setup
Server: ubuntu 20, running
iRODS 4.3.1

Client: Windows 10 VM
PRC 2.0.1
ibridges 0.2.2
As the issue is time related it might help to know the connection to the iRODS server is 1GB/sec which is fully utilized during the transfer.

Background information
The default connection timeout in the PRC is 120 seconds: https://github.com/irods/python-irodsclient/blob/c3963c21914ac859a9ae65d5834681c566d5ada3/irods/__init__.py#L43C33-L44C1

It is known that this leads to timeout during the checksum calculation of larger files:
irods/python-irodsclient#564

As a solution the following lines have been added in iBridges:

original_timeout = session.irods_session.pool.connection_timeout

Issue
When we perform data operations, like checking if the object already exists two sockets will be created with the default timeout. If we than upload a large file, 50Gb still goes fine. At 100GB we consistently see this issue (we did not look for the exact file size the issue starts.)
A breakpoint at irods/connection.py line 245
informs us that two new sockets are created when the upload starts, both with a timeout of 12885.
Next the file uploads as expected. When the transfer is complete and the server starts the checksum calculation we get this network exception (after exactly 2 minutes), the socket which throws the error has a timeout of 120 seconds as shown in the image.
image
And if we wait a few minutes we see the checksum appears in iRODS:
icommands: ils -l
/xx/home/xx:
xx 0 hot_1; 8589934588 2024-06-20.08:53 & 100GB_file.txt

example code

import json
from pathlib import Path
from os import path
from getpass import getpass

from ibridges import Session
from ibridges.data_operations import upload
from ibridges.path import IrodsPath


if __name__ == "__main__":
    # Environment file
    env_file = Path(path.expanduser("~")).joinpath(".irods", "irods_environment.json")
    if not path.exists(env_file):
        print('Environment file not found')
        exit()

    # Authenticate
    with open(env_file, "r") as f:
        ienv = json.load(f)

    password = getpass("Your iRODS password")
    session = Session(irods_env=ienv, password=password)

    # Upload
    filename = "100GB.txt"
    if not path.exists(filename):
        size_ingb = 100
        with open(filename, 'wb') as f:
            for gb in range(0, size_ingb, 1):
                f.write(b'\0' * (10 * 1024**3))
                f.flush()
    target_ipath = IrodsPath(session)
    irods_path = target_ipath.joinpath(filename)
    options = {'regChksum': True, 'verifyChksum': True, 'ChksumAll': True}
    if not irods_path.dataobject_exists():
        print(f"Uploading {filename} to {irods_path}")
        upload(session, filename, irods_path, overwrite=True, options=options)

If we modify iBridges by adding:

irods_session.connection_timeout = 250000
irods_session.pool.connection_timeout = 250000

too:

_ = irods_session.server_version

and
_ = irods_session.server_version

We see that all sockets get the higher timeout and we don't experience any network exceptions.

@luijs
Copy link

luijs commented Jun 20, 2024

Note that it doesn't work if you add the timeout after the line _ = irods_session.server_version, it really needs to be before.

@d-w-moore
Copy link

Note that it doesn't work if you add the timeout after the line _ = irods_session.server_version, it really needs to be before.

True. That said, if you invoke cleanup( ) on the session object , the increased time out should always take effect. This empties the idle and active pools of existing connections.

session.connection_timeout = LONG_TIME_IN_SECONDS     # -- increase the interval for socket timeout
session.cleanup( )
# all operations should wait the increased interval

@qubixes
Copy link
Collaborator

qubixes commented Jun 20, 2024

@wurDevTim Thanks for the bug report. From the discussion and reading through the PRC code, if I understand correctly, the following happens:

  • Session is initialized with connection_timeout 120
  • iBridges detects a large file and tries to dynamically update this connection_timeout to something bigger (depending on the file size) by setting the session.pool.connection_timeout to a higher value.
  • The upload creates a new connection with the longer timeout.
  • The checksum part of the upload uses a recycled connection with the initial 120s value. (I'm not sure which connection is recycled, but I suppose that doesn't matter for now.)
  • The checksum takes longer than 120s -> times out.

As for the solution, I'm not sure what the best durable solution would be. As far as I understand, there is no mechanism to ensure that connections in the pool will have the new connection_timeout, since old/idle connections can be reused. My first idea would be to setup a whole new connection pool when iBridges detects a file that might be too large. This pool would have the new connection timeout then. Another option would be to reset the session (with the cleanup()) as above, but that might be relatively expensive, and that might also not work so well when we want to implement the asynchronous operations in iBridges. Any (other) suggestions are welcome.

For now, a workaround would be to increase the connection_timeout in the .irods_environment.json file. @wurDevTim did you check if this fixes the problem? I will add a question to the FAQ to this extent for users as well.

@qubixes qubixes self-assigned this Jun 20, 2024
@qubixes qubixes added this to the v1.x milestone Jun 20, 2024
@wurDevTim
Copy link
Collaborator Author

I might have another suggestion, have a look here: https://github.com/irods/python-irodsclient/blob/c3963c21914ac859a9ae65d5834681c566d5ada3/irods/pool.py#L65
Maybe I misunderstand the functionality, but I read it as adding "refresh_time": 300 (any time shorter than the transfer time of the file) to the json should solve our issue. Assuming the two newly created connections with the longer timeouts of 12885 are use during the transfer. This would mean the connections with the default timeout of 120 are idle during the transfer, which takes more than 5 minutes. Therefore they should be dropped, instead of reused for the checksum. In my tests this is not working, which makes me suspect this functionality is either broken or I misunderstand?
This might be a way to solve the issue.

Adding cleanup() to line:


did not work in my tests, I also forsee issues with asynchronous transfers. Therefor I have not ran any additional tests.
Adding "connection_timeout": 250000 to the json does solve the issue, while not ideal it can be used as a workaround. It has the same effect as setting the connection_timeout of the session object directly after its creation.

@qubixes
Copy link
Collaborator

qubixes commented Jun 21, 2024

@wurDevTim What did you try exactly in your tests? I would indeed from the code expect that if pool.refresh_connection == True and pool.connection_refresh_time == 0 would solve the issue (i.e. no re-use). Perhaps setting the logging to debug level will shed some light on the issue.

@wurDevTim
Copy link
Collaborator Author

wurDevTim commented Jun 21, 2024

@qubixes my main challenge with debugging further is that I am not sure if what we expect is indeed the intended behavior of the 'refresh_time'. If it is, than it looks like we found another bug. @d-w-moore can you shed some light on this?

A description of my tests:
Test 1
Add "refresh_time": 300 to the environment json, run the script from my initial post.
During the checksum calculation there is a NetworkException.
The exception stops the debugger which allows me to check the socket, the timeout was set to 120.
A few minutes later the checksum shows up on the iRODS server, the transfer was successful.

Test 2 (newly added)
If you look at line https://github.com/irods/python-irodsclient/blob/c3963c21914ac859a9ae65d5834681c566d5ada3/irods/pool.py#L45 you will see that setting the "refresh_time": 300 to 0 won't work. You can set it to '1', which I did.
Also verified that pool.refresh_connection == True and pool.connection_refresh_time == 1 when the session is created. Unfortunately this gives the same result as test 1.

Test 3
Remove the "refresh_time": 300 and change: iBridges/ibridges/data_operations.py to: session.irods_session.cleanup()
I believe this is the solution @d-w-moore suggested?
Ran the script from my initial post, during the checksum calculation there is a NetworkException.
The exception stops the debugger which allows me to check the socket, the timeout was set to 120.
A few minutes later the checksum shows up on the iRODS server, the transfer was successful.

Test 4
Reinstall iBridges to ensure we have a clean installation again.
Add "connection_timeout": 250000 to my environment json, run the script from my initial post.
No 'NetworkException', the python scripts exits as intended. Both the file and checksum are on the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants