Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

careful selection of hitpass vs hitmiss #2865

Closed
dridi opened this issue Dec 11, 2018 · 8 comments
Closed

careful selection of hitpass vs hitmiss #2865

dridi opened this issue Dec 11, 2018 · 8 comments
Assignees

Comments

@dridi
Copy link
Member

dridi commented Dec 11, 2018

Loosely related to #2864.

In all cases, I'm talking about vcl_backend_response here.

With the (now somewhat old) switch from hitpass to hitmiss when resp.uncacheable is set and the reintroduction of hitpass via return(pass) we have witnessed an increase in objects. It is of course a known behavior, but we think we could do better.

First, there's a significant difference between return(pass) and return(pass(DURATION)). The former keeps the current state of beresp.{ttl,grace,keep} and can lead to #2864 while the latter uses the DURATION as the TTL and disables grace and keep altogether.

Another difference is that specifying a duration for a hitpass produces an HFP TTL log record, but a plain return(pass) doesn't. I think it should, and in both cases we probably don't want grace or keep.

Below is the current built-in VCL:

sub vcl_backend_response {
    if (bereq.uncacheable) {
        return (deliver);
    } else if (beresp.ttl <= 0s ||
      beresp.http.Set-Cookie ||
      beresp.http.Surrogate-control ~ "no-store" ||
      (!beresp.http.Surrogate-Control &&
        beresp.http.Cache-Control ~ "no-cache|no-store|private") ||
      beresp.http.Vary == "*") {
        # Mark as "Hit-For-Miss" for the next 2 minutes
        set beresp.ttl = 120s;
        set beresp.uncacheable = true;
    }
    return (deliver);
}

Using hitmiss for all cases leads to an increased number of objects during normal operations. The body of such objects is properly cleaned up once delivered to the client (like a pass or a hitpass) but that's still a problem for backend-trusting setups. In particular, hit_misses pile up.

The suggestion is to only use hitmiss for its purpose: enable opportunistic fetches on the assumption that the current transaction is not cacheable but similar transactions should be. In most cases, HTTP-compliant backends are telling us that an object should not be cacheable. In fact the only case that should lead to a hitmiss is when a client does not show up with a cookie and the backend decides to assign one while serving a cacheable resource.

Suggested built-in:

sub vcl_backend_response {
    if (bereq.uncacheable) {
        return (deliver);
    }

    if (beresp.ttl <= 0s ||
      beresp.http.Surrogate-control ~ "no-store" ||
      beresp.http.Cache-Control ~ "no-cache|no-store" ||
      beresp.http.Vary == "*") {
        # Mark as "Hit-For-Pass" for the next 2 minutes
        return (pass(120s));
    }

    if (beresp.http.Set-Cookie) {
        # Mark as "Hit-For-Miss" for the next 2 minutes
        set beresp.ttl = 120s;
        set beresp.uncacheable = true;
        set beresp.storage = storage.Transient;
        return (deliver);
    }

    if (beresp.http.Cache-Control ~ "private") {
        # Mark as "Hit-For-Pass" for the next 2 minutes
        return (pass(120s));
    }

    return (deliver);
}

You may notice that I also changed the Surrogate-Control logic because even if we aren't told as an edge server not to store, we should still look at the Cache-Control to figure whether the response is cacheable.

A hitpass also adds the opportunity to later fetch something cacheable, but that's only true when the underlying resource changes altogether. A well behaving backend, when setting a cookie on a cacheable response should say that it is private (cacheable only by the original client) and in the absence of no-cache or no-store directives, only then can we assume a response is hitmiss material.

This is a problem for setups with well-behaving backends that properly drive the cache policy directly from response headers.

@dridi
Copy link
Member Author

dridi commented Dec 11, 2018

I edited the VCL to add a missing return(deliver), sorry.

@dridi
Copy link
Member Author

dridi commented Dec 11, 2018

Another point I forgot to convey is that we ran into situations where we don't want a hitmiss to land in Transient. Instead of doing this in core code, it could be part of built-in VCL.

I will edit the description accordingly.

Reading the Edge Architecture Specification I'm also wondering whether we should support it in the built-in VCL too. It can be used to drive a different TTL+grace for Varnish that is different than the client's TTL or convey ESI support. And since we now version the VCL syntax, another open question for such a change would be to version the built-in too.

@karptonite
Copy link

This is a bit off topic, but your change to the Surrogate-Control logic also solves another issue. As you say, the presence of a Surrogate-Control header doesn't invalidate Cache-Control: no-cache|no-store; I'm using Surrogate-Control: content="ESI/1.0" to indicate that ESI should be processed, and that header was preventing the Cache-Control header from doing its thing before I added some custom VCL.

@slimhazard
Copy link
Contributor

+-0 due to mixed experience with hitpass and hitmiss. Really unsure about the right solution.

I too have been alarmed at the large number of objects created by hitpass. In a production project we tried it when hitmiss became possible, but went back to hitpass due to the memory consumption. Especially since the memory usage seemed to keep increasing, so that it wasn't obvious if it could be kept within bounds set for a -s allocator. In essence, we used custom VCL to go back to something similar to what the patch proposes (which can be taken as evidence that @dridi has the right idea).

OTOH hitmiss was introduced because hitpass was notoriously hard to understand, and many users ran into problems with objects that became uncacheable and there was nothing you can do about it until the hitpass-TTL expires. For many users it seemed to be an inexplicable bug in Varnish -- maybe they got an explanation on varnish-misc or #varnish, but there's no telling how many users came away thinking that Varnish is broken.

The patch returns hitpass with a 2-minute TTL for most cases -- it becomes more of the rule rather than the exception. So it would bring us back to trying to explaining why objects become uncacheable for two minutes, no matter what Cache-Control says.

I guess it has a lot to do with whether @dridi is right about this:

In fact the only case that should lead to a hitmiss is when a client does not show up with a cookie and the backend decides to assign one while serving a cacheable resource.

Or maybe there's a way to reduce the number of objects created by hitmiss?

BTW, the use of Transient sort of works around the problem of memory consumption, because Transient is unbounded by default. But in a setup where you use -s Transient to set bounds, so as to prevent running out of memory if Transient gets excessively large, you can still hit the limits.

@bsdphk
Copy link
Contributor

bsdphk commented Jan 21, 2019

We clearly need to think more about this. Will bring back at a near future bugwash.

@nigoroll
Copy link
Member

nigoroll commented Feb 4, 2019

notes on bugwash:

  • I fail to see the issue this ticket hinges on, explained by @dridi as a high number of HFM objects if hitmiss objects are inserted concurrently. We should understand if/how this is different from high number of objects with hit-for-miss #2754
  • we should change hfm and hfp such that grace and keep are always zero
  • we should use a low hfm ttl in builtin.vcl / our examples, because a long ttl has only minor benefits (basically the hfm ttl is just the interval for which we prevent coalescing)

@bsdphk
Copy link
Contributor

bsdphk commented Feb 4, 2019

@dridi moves this to VIP (with #2864) until actionable consensus.

@bsdphk
Copy link
Contributor

bsdphk commented Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants