Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry after failure to download or failure to open files #4609

Merged
merged 9 commits into from
Jan 31, 2022

Conversation

emmyoop
Copy link
Member

@emmyoop emmyoop commented Jan 21, 2022

resolves #4579, #4577

Description

  • Add retry logic when there's a problem extracting the tarball.
  • Fix retry logic when a response is NoneType.
  • Add relevant unit tests.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change

@cla-bot cla-bot bot added the cla:yes label Jan 21, 2022
@emmyoop emmyoop changed the title Er/ct 51 handle bad tar Retry after failure to download or failure to open files Jan 21, 2022
Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still in draft but I wanted to address your questions:

The part that's hard to test is the download. I'd separate that out so that you can test unpacking the tarball with an integration test that commits a tarball as part of the test. You can do one well-formed tarball and one mal-formed one to test both sides.

As for a debug log message, I might consider making this a longer one. Listing that this might be an issue with github and to check their status page, if it's a new release to consider logging a ticket with the package maintainers, and maybe even include steps or a link to steps that show how to manually install the dependency and to log a bug with the core team if a manual install worked.

deps_path,
package_name
)
connection_exception_retry(download_untar_fn, 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help with testing and minimizing retry logic to separate downloading and unpacking the tarball into two distinct steps. If the error is with unpacking, redoing the download would be unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathaniel-may we do possibly need to retry the download when unpacking the tarball fails. We have data from the last github incident where only about half of the runs failed during the incident implying that some jobs were able to run dbt deps successfully. It's impossible to know why but seems worth it to retry downloading the file in case something went wrong there.

That said, pulling them apart will make it easier to test so I'll incorporate both for the purpose of testing.

@emmyoop emmyoop added the backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch label Jan 26, 2022
@emmyoop emmyoop linked an issue Jan 27, 2022 that may be closed by this pull request
1 task
@emmyoop emmyoop marked this pull request as ready for review January 27, 2022 21:34
@emmyoop emmyoop requested review from a team as code owners January 27, 2022 21:34
@@ -485,7 +485,7 @@ def untar_package(
) -> None:
tar_path = convert_path(tar_path)
tar_dir_name = None
with tarfile.open(tar_path, 'r') as tarball:
with tarfile.open(tar_path, 'r:gz') as tarball:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception we were raising was in a section where the code was figuring out the compression type. Since we know it's gz, set it and hopefully bubble up a better exception.

@emmyoop emmyoop requested review from nathaniel-may and removed request for a team January 28, 2022 16:13
Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I left some minor suggestions for improving the test readability, and before I approve I want to understand that one last line. But nothing I've said in this review should block a merge.

pass

def test_untar_package_success(self):
with NamedTemporaryFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional suggestion for readability, if the tarball creations for these tests were in separate methods, the tests themselves could read more cleanly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're all actually different so I'll leave them in the test but comment what's happening for clarity.

tar.addfile(tarfile.TarInfo(relative_file_a), open(file_a.name))

assert tarfile.is_tarfile(tar.name)
dbt.clients.system.untar_package(tar_file_full_path, self.tempdest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least comment here that THIS is the thing we're testing so that future failures are easier to debug.

with self.assertRaises(tarfile.ReadError) as exc:
dbt.clients.system.untar_package(tar_file_path, self.tempdest)

def test_untar_package_empty(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's comment all the tests with the situation they're testing, and the expected results that we're asserting.

@@ -185,3 +188,55 @@ def tearDown(self):
shutil.rmtree(self.base_dir)
except:
pass


class TestUntarPackage(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the strategies you've employed here. Making the tarballs in the tests is much cleaner than my previous suggestion of committing tarball files.

def test_connection_exception_retry_success_none_response(self):
Counter._reset()
connection_exception_retry(lambda: Counter._add_with_none_exception(), 5)
self.assertEqual(2, counter) # 2 = original attempt returned None, plus 1 retry

def test_connection_exception_retry_success_failed_untar(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love a good unit test.

@@ -600,14 +601,15 @@ def __contains__(self, name) -> bool:

def _connection_exception_retry(fn, max_attempts: int, attempt: int = 0):
"""Attempts to run a function that makes an external call, if the call fails
on a connection error or timeout, it will be tried up to 5 more times.
on a connection error, timeout or other exception, it will be tried up to 5 more times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are "other exceptions" all related to decoding/reading? should this comment be more specific? It might be able to provide historical context by including a ticket number.

@@ -774,6 +774,10 @@ def system_error(operation_name):


class ConnectionException(Exception):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment on lines +86 to +92
"""
Sometimes the download of the files fails and we want to retry. Sometimes the
download appears successful but the file did not make it through as expected
(generally due to a github incident). Either way we want to retry downloading
and untarring to see if we can get a success. Call this within
`_connection_exception_retry`
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -33,7 +33,7 @@ def _get(path, registry_base_url=None):
resp = requests.get(url, timeout=30)
fire_event(RegistryProgressGETResponse(url=url, resp_code=resp.status_code))
resp.raise_for_status()
if resp is None:
if resp.json() is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only line I don't totally understand. Can you describe what's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a response that is not None but response.json() is None. This was originally reported as needed to be retried (why raising ContentDecodingError was added) by #4601. #4577 was reported because the problem was never actually resolved because we were never falling into this if statement. This change will force the retires as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha. So what does this json() method do? what does it signify when it returns None

Copy link
Member Author

@emmyoop emmyoop Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resp is a response object (attributes of content, headers, status_code, etc) and the json() method returns the json-encoded content of the response (or raises an error).

In some cases it is a completely legitimate response. You made a valid request but there's no data for what you requested. We do not expect an empty response. The content of the response should either not be None or we should get an error (and catch said error with resp.raise_for_status() on line 35). From what's been gathered it looks like this is happening when something is wrong with the hub. It's honestly difficult to replicate. Catching it and logging the full response will hopefully be helpful too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh this makes sense. I might add a comment here about having no json encoded data and that being unexpected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added.

Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very clean, and nicely commented.

Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@emmyoop emmyoop merged commit d9cfeb1 into main Jan 31, 2022
@emmyoop emmyoop deleted the er/ct-51-handle-bad-tar branch January 31, 2022 16:26
@github-actions
Copy link
Contributor

The backport to 1.0.latest failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.0.latest 1.0.latest
# Navigate to the new working tree
cd .worktrees/backport-1.0.latest
# Create a new branch
git switch --create backport-4609-to-1.0.latest
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick --mainline 1 d9cfeb1ea33cf55c5b96ef1f729eab637e548cb9
# Push it to GitHub
git push --set-upstream origin backport-4609-to-1.0.latest
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.0.latest

Then, create a pull request where the base branch is 1.0.latest and the compare/head branch is backport-4609-to-1.0.latest.

emmyoop added a commit that referenced this pull request Jan 31, 2022
* add retry logic, tests when extracting tarfile fails

* fixed bug with not catching empty responses

* specify compression type

* WIP test

* more testing work

* fixed up unit test

* add changelog

* Add more comments!

* clarify why we do the json() check for None
# Conflicts:
#	CHANGELOG.md
emmyoop added a commit that referenced this pull request Jan 31, 2022
* add retry logic, tests when extracting tarfile fails

* fixed bug with not catching empty responses

* specify compression type

* WIP test

* more testing work

* fixed up unit test

* add changelog

* Add more comments!

* clarify why we do the json() check for None
# Conflicts:
#	CHANGELOG.md
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
* add retry logic, tests when extracting tarfile fails

* fixed bug with not catching empty responses

* specify compression type

* WIP test

* more testing work

* fixed up unit test

* add changelog

* Add more comments!

* clarify why we do the json() check for None

automatic commit by git-black, original commits:
  d9cfeb1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch cla:yes
Projects
None yet
3 participants