Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry import to the docker hub on 401 #8558

Merged
merged 3 commits into from
Apr 20, 2016

Conversation

smarterclayton
Copy link
Contributor

The DockerHub apparently can race / flake / return 401 on some responses
to retrieve public images (approx 1/200 imports?). Perform a simple
retry on 401 from the DockerHub up to a few times.

Still needs to be made more specific to DockerHub, and also needs a selective fix for the legacy pkg/dockerregistry client.

Fixes #7429 #8399 #8444

@smarterclayton
Copy link
Contributor Author

Not complete yet, but mostly complete [test]

}
}

// Get retrieves the identified by the digest, if it exists.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/identified/Manifest \0/

@soltysh
Copy link
Contributor

soltysh commented Apr 19, 2016

Mostly nits, otherwise it LGTM.

I hate we're doing yet another docker-specific logic inside the code.

@smarterclayton
Copy link
Contributor Author

This is reworked from before - we retry immediately initially, then wait afterwards up to either a max time or max retries (2). Comments also reflect that this is about the docker hub JWT token being in the same second as the request and thus not valid (they don't backdate JWT tokens).

DockerHub generates JWT tokens that are valid from "now", which means
fast actions can get a 401 unauthorized if the second boundary on the
hub servers is not aligned. We retry 401 unauthorized requests
immediately (one time), then wait a small window of time before
retrying. If our operation is too slow (longer than the window) we do
not retry.

Probably an issue with Docker Trusted Registries as well.
@liggitt
Copy link
Contributor

liggitt commented Apr 19, 2016

given that the root issue is a dockerhub flake, I think this should be post-1.2

@liggitt
Copy link
Contributor

liggitt commented Apr 19, 2016

reported flake at docker/hub-feedback#647 with reproducible testcase at https://gist.github.com/liggitt/e7d06dc1cbc9d44034f762c5654c6fff

@miminar
Copy link

miminar commented Apr 19, 2016

@liggitt rules!

@smarterclayton
Copy link
Contributor Author

Hrm, pull through still failed - because I don't have the retry logic in the registry.

We want to run at least one time at the full interval mark, which we
were not doing before.

// nil error doesn't consume retries
if r.shouldRetry(nil) {
t.Fatal(r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still unclear to me, why you're doing Fatalf here instead of just Errorf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies to next test case, as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's no reason to continue the test if the condition is violated

On Tue, Apr 19, 2016 at 4:07 PM, Maciej Szulik notifications@github.com
wrote:

In pkg/image/importer/client_test.go
#8558 (comment):

  •   t.Errorf("Expected ErrNotV2Registry, got %v", err)
    
  • }
  • uri.Scheme = "https"
  • _, err = retriever.ping(*uri, true, retriever.context.InsecureTransport)
  • if !strings.Contains(err.Error(), "does not support v2 API") {
  •   t.Errorf("Expected ErrNotV2Registry, got %v", err)
    
  • }
    +}

+func TestShouldRetry(t *testing.T) {

  • r := NewRetryRepository(nil, 1, 0).(*retryRepository)
  • // nil error doesn't consume retries
  • if r.shouldRetry(nil) {
  •   t.Fatal(r)
    

It's still unclear to me, why you're doing Fatalf here instead of just
Errorf?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/e1c50b07d5369b627aa475724c2689dbdeb61a98#r60301490

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, still I personally prefer to run as much tests as possible. One violated condition doesn't mean all others will be incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it does - if you don't handle nil errors correctly the logic
of the rest won't work.

On Tue, Apr 19, 2016 at 5:26 PM, Maciej Szulik notifications@github.com
wrote:

In pkg/image/importer/client_test.go
#8558 (comment):

  •   t.Errorf("Expected ErrNotV2Registry, got %v", err)
    
  • }
  • uri.Scheme = "https"
  • _, err = retriever.ping(*uri, true, retriever.context.InsecureTransport)
  • if !strings.Contains(err.Error(), "does not support v2 API") {
  •   t.Errorf("Expected ErrNotV2Registry, got %v", err)
    
  • }
    +}

+func TestShouldRetry(t *testing.T) {

  • r := NewRetryRepository(nil, 1, 0).(*retryRepository)
  • // nil error doesn't consume retries
  • if r.shouldRetry(nil) {
  •   t.Fatal(r)
    

Make sense, still I personally prefer to run as much tests as possible.
One violated condition doesn't mean all others will be incorrect.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/e1c50b07d5369b627aa475724c2689dbdeb61a98#r60314653

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're talking about, the check after the initial fail doesn't make sense - agreed with that. But additional shouldRetry checks still do. I'd generally convert all these in to table tests, skipping the additional checks if shouldRetry returned wrong output, but at least going through all possibilities. That was my intention rising this :)

@smarterclayton
Copy link
Contributor Author

Excellent - I can now consistently recreate the pull through flake, even
though this in theory fixes it.

On Tue, Apr 19, 2016 at 4:50 PM, OpenShift Bot notifications@github.com
wrote:

continuous-integration/openshift-jenkins/test FAILURE (
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3107/)


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#8558 (comment)

@smarterclayton
Copy link
Contributor Author

So https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/187/artifact/origin/artifacts/test-end-to-end-docker/logs/container-docker-registry-1-sirfi-registry.log just failed for pull through, with this error:

172.18.11.255 - - [19/Apr/2016:20:45:12 +0000] "GET /v2/cache/mysql/blobs/sha256:efd26ecc95486998b41b3fe167236e3fb3e109c66dd1a51ab5161e40b06cf486 HTTP/1.1" 200 51342828 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"
172.18.11.255 - - [19/Apr/2016:20:45:18 +0000] "GET /v1/repositories/cache/mysql/images HTTP/1.1" 404 19 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"

@soltysh could this be your context bug thing? I'm recreating this consistently - are we not writing the response correctly?

@smarterclayton
Copy link
Contributor Author

So the problem now with pull through is not an import failure - it's some other condition

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 9dd0f31

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3113/)

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 20, 2016 via email

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5639/) (Image: devenv-rhel7_4001)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 9dd0f31

@openshift-bot openshift-bot merged commit 1a02064 into openshift:master Apr 20, 2016
@smarterclayton
Copy link
Contributor Author

Crap, did not mean to merge this.

return false
}

if r.retries <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know the calling code well enough to know if we need to take a lock here to avoid racing between goroutines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concurrent calls (underlying docker client is not thread safe)

On Apr 19, 2016, at 10:54 PM, Jordan Liggitt notifications@github.com
wrote:

In pkg/image/importer/client.go
#8558 (comment):

  • wait:    wait,
    
  • limit:   interval,
    
  • }
    +}
    +
    +// shouldRetry returns true if the error is not an unauthorized error, if there are no retries left, or if
    +// we have already retried once and it has been longer than r.limit since we retried the first time.
    +func (r *retryRepository) shouldRetry(err error) bool {
  • if err == nil {
  • return false
    
  • }
  • if !isDockerError(err, errcode.ErrorCodeUnauthorized) {
  • return false
    
  • }
    +
  • if r.retries <= 0 {

don't know the calling code well enough to know if we need to take a lock
here to avoid racing between goroutines


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/9dd0f3197b9856bd93d606507abddfe896f110f3#r60344323

@soltysh
Copy link
Contributor

soltysh commented Apr 20, 2016

@soltysh could this be your context bug thing? I'm recreating this consistently - are we not writing the response correctly?

I've identified following problems here, although it looks like the following are caused by the first one:

  1. 401 Unathorized
time="2016-04-19T20:45:11.048087165Z" level=error msg="Failed copying content from remote store \"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4\": unauthorized: authentication required"...
  1. Content-Length problem:
time="2016-04-19T20:45:11.048213314Z" level=error msg="error serving error json: Conn.Write wrote more than the declared Content-Length (from unknown: unknown error)"

So the problem now with pull through is not an import failure - it's some other condition

The auth problem bites us in 2 separate places, as I described in #8399 (comment):

  1. at import time
  2. at pull time

With this PR you've addressed import, but still pullthrough logic is affected.

@soltysh
Copy link
Contributor

soltysh commented Apr 20, 2016

@smarterclayton we'd have to add the retry logic as well in pullthroublobstore.go 's Stat and ServeBlob methods, agreed?

@smarterclayton
Copy link
Contributor Author

We need to add the logic to the blob store wrapper in retry. Opened #8574

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test-cmd new-app image stream import flake
5 participants