-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry after failure to download or failure to open files #4609
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is still in draft but I wanted to address your questions:
The part that's hard to test is the download. I'd separate that out so that you can test unpacking the tarball with an integration test that commits a tarball as part of the test. You can do one well-formed tarball and one mal-formed one to test both sides.
As for a debug log message, I might consider making this a longer one. Listing that this might be an issue with github and to check their status page, if it's a new release to consider logging a ticket with the package maintainers, and maybe even include steps or a link to steps that show how to manually install the dependency and to log a bug with the core team if a manual install worked.
deps_path, | ||
package_name | ||
) | ||
connection_exception_retry(download_untar_fn, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help with testing and minimizing retry logic to separate downloading and unpacking the tarball into two distinct steps. If the error is with unpacking, redoing the download would be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathaniel-may we do possibly need to retry the download when unpacking the tarball fails. We have data from the last github incident where only about half of the runs failed during the incident implying that some jobs were able to run dbt deps
successfully. It's impossible to know why but seems worth it to retry downloading the file in case something went wrong there.
That said, pulling them apart will make it easier to test so I'll incorporate both for the purpose of testing.
040f7e2
to
bfe7b03
Compare
@@ -485,7 +485,7 @@ def untar_package( | |||
) -> None: | |||
tar_path = convert_path(tar_path) | |||
tar_dir_name = None | |||
with tarfile.open(tar_path, 'r') as tarball: | |||
with tarfile.open(tar_path, 'r:gz') as tarball: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception we were raising was in a section where the code was figuring out the compression type. Since we know it's gz
, set it and hopefully bubble up a better exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I left some minor suggestions for improving the test readability, and before I approve I want to understand that one last line. But nothing I've said in this review should block a merge.
pass | ||
|
||
def test_untar_package_success(self): | ||
with NamedTemporaryFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional suggestion for readability, if the tarball creations for these tests were in separate methods, the tests themselves could read more cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're all actually different so I'll leave them in the test but comment what's happening for clarity.
tar.addfile(tarfile.TarInfo(relative_file_a), open(file_a.name)) | ||
|
||
assert tarfile.is_tarfile(tar.name) | ||
dbt.clients.system.untar_package(tar_file_full_path, self.tempdest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would at least comment here that THIS is the thing we're testing so that future failures are easier to debug.
with self.assertRaises(tarfile.ReadError) as exc: | ||
dbt.clients.system.untar_package(tar_file_path, self.tempdest) | ||
|
||
def test_untar_package_empty(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's comment all the tests with the situation they're testing, and the expected results that we're asserting.
@@ -185,3 +188,55 @@ def tearDown(self): | |||
shutil.rmtree(self.base_dir) | |||
except: | |||
pass | |||
|
|||
|
|||
class TestUntarPackage(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the strategies you've employed here. Making the tarballs in the tests is much cleaner than my previous suggestion of committing tarball files.
def test_connection_exception_retry_success_none_response(self): | ||
Counter._reset() | ||
connection_exception_retry(lambda: Counter._add_with_none_exception(), 5) | ||
self.assertEqual(2, counter) # 2 = original attempt returned None, plus 1 retry | ||
|
||
def test_connection_exception_retry_success_failed_untar(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love a good unit test.
core/dbt/utils.py
Outdated
@@ -600,14 +601,15 @@ def __contains__(self, name) -> bool: | |||
|
|||
def _connection_exception_retry(fn, max_attempts: int, attempt: int = 0): | |||
"""Attempts to run a function that makes an external call, if the call fails | |||
on a connection error or timeout, it will be tried up to 5 more times. | |||
on a connection error, timeout or other exception, it will be tried up to 5 more times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are "other exceptions" all related to decoding/reading? should this comment be more specific? It might be able to provide historical context by including a ticket number.
@@ -774,6 +774,10 @@ def system_error(operation_name): | |||
|
|||
|
|||
class ConnectionException(Exception): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
""" | ||
Sometimes the download of the files fails and we want to retry. Sometimes the | ||
download appears successful but the file did not make it through as expected | ||
(generally due to a github incident). Either way we want to retry downloading | ||
and untarring to see if we can get a success. Call this within | ||
`_connection_exception_retry` | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@@ -33,7 +33,7 @@ def _get(path, registry_base_url=None): | |||
resp = requests.get(url, timeout=30) | |||
fire_event(RegistryProgressGETResponse(url=url, resp_code=resp.status_code)) | |||
resp.raise_for_status() | |||
if resp is None: | |||
if resp.json() is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only line I don't totally understand. Can you describe what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get a response that is not None
but response.json()
is None
. This was originally reported as needed to be retried (why raising ContentDecodingError
was added) by #4601. #4577 was reported because the problem was never actually resolved because we were never falling into this if
statement. This change will force the retires as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha. So what does this json()
method do? what does it signify when it returns None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resp
is a response object (attributes of content, headers, status_code, etc) and the json()
method returns the json-encoded content of the response (or raises an error).
In some cases it is a completely legitimate response. You made a valid request but there's no data for what you requested. We do not expect an empty response. The content of the response should either not be None or we should get an error (and catch said error with resp.raise_for_status()
on line 35). From what's been gathered it looks like this is happening when something is wrong with the hub. It's honestly difficult to replicate. Catching it and logging the full response will hopefully be helpful too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh this makes sense. I might add a comment here about having no json encoded data and that being unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added.
bfe7b03
to
e9ff1cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very clean, and nicely commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
7f0f48b
to
b80a6cc
Compare
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.0.latest 1.0.latest
# Navigate to the new working tree
cd .worktrees/backport-1.0.latest
# Create a new branch
git switch --create backport-4609-to-1.0.latest
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick --mainline 1 d9cfeb1ea33cf55c5b96ef1f729eab637e548cb9
# Push it to GitHub
git push --set-upstream origin backport-4609-to-1.0.latest
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.0.latest Then, create a pull request where the |
* add retry logic, tests when extracting tarfile fails * fixed bug with not catching empty responses * specify compression type * WIP test * more testing work * fixed up unit test * add changelog * Add more comments! * clarify why we do the json() check for None # Conflicts: # CHANGELOG.md
* add retry logic, tests when extracting tarfile fails * fixed bug with not catching empty responses * specify compression type * WIP test * more testing work * fixed up unit test * add changelog * Add more comments! * clarify why we do the json() check for None # Conflicts: # CHANGELOG.md
* add retry logic, tests when extracting tarfile fails * fixed bug with not catching empty responses * specify compression type * WIP test * more testing work * fixed up unit test * add changelog * Add more comments! * clarify why we do the json() check for None automatic commit by git-black, original commits: d9cfeb1
resolves #4579, #4577
Description
NoneType
.Checklist
CHANGELOG.md
and added information about my change