Retry after failure to download or failure to open files #4609

emmyoop · 2022-01-21T20:57:24Z

Description

Add retry logic when there's a problem extracting the tarball.
Fix retry logic when a response is NoneType.
Add relevant unit tests.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change

nathaniel-may

I know this is still in draft but I wanted to address your questions:

The part that's hard to test is the download. I'd separate that out so that you can test unpacking the tarball with an integration test that commits a tarball as part of the test. You can do one well-formed tarball and one mal-formed one to test both sides.

As for a debug log message, I might consider making this a longer one. Listing that this might be an issue with github and to check their status page, if it's a new release to consider logging a ticket with the package maintainers, and maybe even include steps or a link to steps that show how to manually install the dependency and to log a bug with the core team if a manual install worked.

nathaniel-may · 2022-01-24T15:44:03Z

core/dbt/deps/registry.py

+            deps_path,
+            package_name
+        )
+        connection_exception_retry(download_untar_fn, 5)


It might help with testing and minimizing retry logic to separate downloading and unpacking the tarball into two distinct steps. If the error is with unpacking, redoing the download would be unnecessary.

@nathaniel-may we do possibly need to retry the download when unpacking the tarball fails. We have data from the last github incident where only about half of the runs failed during the incident implying that some jobs were able to run dbt deps successfully. It's impossible to know why but seems worth it to retry downloading the file in case something went wrong there.

That said, pulling them apart will make it easier to test so I'll incorporate both for the purpose of testing.

emmyoop · 2022-01-27T21:55:18Z

core/dbt/clients/system.py

@@ -485,7 +485,7 @@ def untar_package(
 ) -> None:
    tar_path = convert_path(tar_path)
    tar_dir_name = None
-    with tarfile.open(tar_path, 'r') as tarball:
+    with tarfile.open(tar_path, 'r:gz') as tarball:


The exception we were raising was in a section where the code was figuring out the compression type. Since we know it's gz, set it and hopefully bubble up a better exception.

nathaniel-may

This is great! I left some minor suggestions for improving the test readability, and before I approve I want to understand that one last line. But nothing I've said in this review should block a merge.

nathaniel-may · 2022-01-28T16:17:36Z

test/unit/test_system_client.py

+            pass
+
+    def test_untar_package_success(self):
+        with NamedTemporaryFile(


Optional suggestion for readability, if the tarball creations for these tests were in separate methods, the tests themselves could read more cleanly.

They're all actually different so I'll leave them in the test but comment what's happening for clarity.

nathaniel-may · 2022-01-28T16:18:18Z

test/unit/test_system_client.py

+                    tar.addfile(tarfile.TarInfo(relative_file_a), open(file_a.name))
+
+        assert tarfile.is_tarfile(tar.name)
+        dbt.clients.system.untar_package(tar_file_full_path, self.tempdest)


I would at least comment here that THIS is the thing we're testing so that future failures are easier to debug.

nathaniel-may · 2022-01-28T16:19:05Z

test/unit/test_system_client.py

+        with self.assertRaises(tarfile.ReadError) as exc:
+            dbt.clients.system.untar_package(tar_file_path, self.tempdest)
+
+    def test_untar_package_empty(self):


Let's comment all the tests with the situation they're testing, and the expected results that we're asserting.

nathaniel-may · 2022-01-28T16:20:26Z

test/unit/test_system_client.py

@@ -185,3 +188,55 @@ def tearDown(self):
            shutil.rmtree(self.base_dir)
        except:
            pass
+
+
+class TestUntarPackage(unittest.TestCase):


Love the strategies you've employed here. Making the tarballs in the tests is much cleaner than my previous suggestion of committing tarball files.

nathaniel-may · 2022-01-28T16:22:10Z

test/unit/test_core_dbt_utils.py

    def test_connection_exception_retry_success_none_response(self):
        Counter._reset()
        connection_exception_retry(lambda: Counter._add_with_none_exception(), 5)
        self.assertEqual(2, counter) # 2 = original attempt returned None, plus 1 retry

+    def test_connection_exception_retry_success_failed_untar(self):


Love a good unit test.

nathaniel-may · 2022-01-28T16:23:23Z

core/dbt/utils.py

@@ -600,14 +601,15 @@ def __contains__(self, name) -> bool:

 def _connection_exception_retry(fn, max_attempts: int, attempt: int = 0):
    """Attempts to run a function that makes an external call, if the call fails
-    on a connection error or timeout, it will be tried up to 5 more times.
+    on a connection error, timeout or other exception, it will be tried up to 5 more times.


are "other exceptions" all related to decoding/reading? should this comment be more specific? It might be able to provide historical context by including a ticket number.

nathaniel-may · 2022-01-28T16:23:45Z

core/dbt/exceptions.py

@@ -774,6 +774,10 @@ def system_error(operation_name):


 class ConnectionException(Exception):
+    """


nathaniel-may · 2022-01-28T16:24:24Z

core/dbt/deps/registry.py

+        """
+        Sometimes the download of the files fails and we want to retry.  Sometimes the
+        download appears successful but the file did not make it through as expected
+        (generally due to a github incident).  Either way we want to retry downloading
+        and untarring to see if we can get a success.  Call this within
+        `_connection_exception_retry`
+        """


nathaniel-may · 2022-01-28T16:25:56Z

core/dbt/clients/registry.py

@@ -33,7 +33,7 @@ def _get(path, registry_base_url=None):
    resp = requests.get(url, timeout=30)
    fire_event(RegistryProgressGETResponse(url=url, resp_code=resp.status_code))
    resp.raise_for_status()
-    if resp is None:
+    if resp.json() is None:


I think this is the only line I don't totally understand. Can you describe what's going on here?

You can get a response that is not None but response.json() is None. This was originally reported as needed to be retried (why raising ContentDecodingError was added) by #4601. #4577 was reported because the problem was never actually resolved because we were never falling into this if statement. This change will force the retires as expected.

gotcha. So what does this json() method do? what does it signify when it returns None

resp is a response object (attributes of content, headers, status_code, etc) and the json() method returns the json-encoded content of the response (or raises an error).

In some cases it is a completely legitimate response. You made a valid request but there's no data for what you requested. We do not expect an empty response. The content of the response should either not be None or we should get an error (and catch said error with resp.raise_for_status() on line 35). From what's been gathered it looks like this is happening when something is wrong with the hub. It's honestly difficult to replicate. Catching it and logging the full response will hopefully be helpful too.

ahh this makes sense. I might add a comment here about having no json encoded data and that being unexpected.

Good point. Added.

gshank

Looks very clean, and nicely commented.

nathaniel-may

Great work!

github-actions · 2022-01-31T16:27:10Z

The backport to 1.0.latest failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.0.latest 1.0.latest
# Navigate to the new working tree
cd .worktrees/backport-1.0.latest
# Create a new branch
git switch --create backport-4609-to-1.0.latest
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick --mainline 1 d9cfeb1ea33cf55c5b96ef1f729eab637e548cb9
# Push it to GitHub
git push --set-upstream origin backport-4609-to-1.0.latest
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.0.latest

Then, create a pull request where the base branch is 1.0.latest and the compare/head branch is backport-4609-to-1.0.latest.

* add retry logic, tests when extracting tarfile fails * fixed bug with not catching empty responses * specify compression type * WIP test * more testing work * fixed up unit test * add changelog * Add more comments! * clarify why we do the json() check for None # Conflicts: # CHANGELOG.md

* add retry logic, tests when extracting tarfile fails * fixed bug with not catching empty responses * specify compression type * WIP test * more testing work * fixed up unit test * add changelog * Add more comments! * clarify why we do the json() check for None automatic commit by git-black, original commits: d9cfeb1

cla-bot bot added the cla:yes label Jan 21, 2022

emmyoop changed the title ~~Er/ct 51 handle bad tar~~ Retry after failure to download or failure to open files Jan 21, 2022

emmyoop requested review from nathaniel-may and gshank January 21, 2022 21:15

nathaniel-may reviewed Jan 24, 2022

View reviewed changes

emmyoop added the backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch label Jan 26, 2022

emmyoop linked an issue Jan 27, 2022 that may be closed by this pull request

[CT-49] [Bug] dbt deps does not handle/retry on a malformed response from hub #4577

Closed

1 task

emmyoop force-pushed the er/ct-51-handle-bad-tar branch from 040f7e2 to bfe7b03 Compare January 27, 2022 20:49

emmyoop marked this pull request as ready for review January 27, 2022 21:34

emmyoop requested review from a team as code owners January 27, 2022 21:34

emmyoop commented Jan 27, 2022

View reviewed changes

emmyoop requested review from nathaniel-may and removed request for a team January 28, 2022 16:13

nathaniel-may reviewed Jan 28, 2022

View reviewed changes

emmyoop force-pushed the er/ct-51-handle-bad-tar branch from bfe7b03 to e9ff1cb Compare January 28, 2022 20:53

emmyoop requested a review from nathaniel-may January 28, 2022 21:56

gshank approved these changes Jan 28, 2022

View reviewed changes

nathaniel-may approved these changes Jan 31, 2022

View reviewed changes

emmyoop added 9 commits January 31, 2022 09:48

add retry logic, tests when extracting tarfile fails

3d0b812

fixed bug with not catching empty responses

2dfecbe

specify compression type

faf932c

WIP test

fee1c4b

more testing work

107bc4e

fixed up unit test

16d93ef

add changelog

568e4f0

Add more comments!

7fee17a

clarify why we do the json() check for None

b80a6cc

emmyoop force-pushed the er/ct-51-handle-bad-tar branch from 7f0f48b to b80a6cc Compare January 31, 2022 15:49

emmyoop merged commit d9cfeb1 into main Jan 31, 2022

emmyoop deleted the er/ct-51-handle-bad-tar branch January 31, 2022 16:26

emmyoop mentioned this pull request Jan 31, 2022

[BACKPORT] Retry after failure to download or failure to open files (#4609) #4649

Merged

aranke mentioned this pull request May 20, 2024

[Backport 1.0.latest] Fix #9907: Add retry to tox to reduce flaky tests due to network failures #10178

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry after failure to download or failure to open files #4609

Retry after failure to download or failure to open files #4609

emmyoop commented Jan 21, 2022 •

edited

Loading

nathaniel-may left a comment •

edited

Loading

nathaniel-may Jan 24, 2022

emmyoop Jan 24, 2022

emmyoop Jan 27, 2022

nathaniel-may left a comment

nathaniel-may Jan 28, 2022

emmyoop Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

nathaniel-may Jan 28, 2022

emmyoop Jan 28, 2022

nathaniel-may Jan 28, 2022

emmyoop Jan 28, 2022 •

edited

Loading

nathaniel-may Jan 28, 2022

emmyoop Jan 28, 2022

gshank left a comment

nathaniel-may left a comment

github-actions bot commented Jan 31, 2022

		@@ -774,6 +774,10 @@ def system_error(operation_name):


		class ConnectionException(Exception):
		"""

Retry after failure to download or failure to open files #4609

Retry after failure to download or failure to open files #4609

Conversation

emmyoop commented Jan 21, 2022 • edited Loading

Description

Checklist

nathaniel-may left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathaniel-may left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emmyoop Jan 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshank left a comment

Choose a reason for hiding this comment

nathaniel-may left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 31, 2022

emmyoop commented Jan 21, 2022 •

edited

Loading

nathaniel-may left a comment •

edited

Loading

emmyoop Jan 28, 2022 •

edited

Loading