Skip to content
This repository has been archived by the owner on Aug 3, 2023. It is now read-only.

Expiration changes hanging deploys #2224

Closed
moonwell-octavius opened this issue Mar 18, 2022 · 23 comments · Fixed by #2226
Closed

Expiration changes hanging deploys #2224

moonwell-octavius opened this issue Mar 18, 2022 · 23 comments · Fixed by #2226
Labels
bug Something isn't working

Comments

@moonwell-octavius
Copy link

🐛 Bug report

Describe the bug

Hi there, we use wrangler via https://github.com/cloudflare/wrangler-action to deploy a workers-site, and our builds stopped working within the past few hours.

Diving into why I notice that wrangler 1.19.9 was just cut a few hours ago.

Built successfully, built project size is 13 KiB.

... snip ...

 Deleting stale files...
Error:  Code 10033: bulk put: 'Invalid expiration of 1647570824. Please specify integer greater than the current number of seconds since the UNIX epoch.'

This makes me think that the recent change is related. When I downgrade by setting wranglerVersion: '1.19.8' in my github actions config things work as expected.

Reproduce the bug

I suspect if you deploy a site with other assets already uploaded it'll choke, not sure we do anything special that would solicit this behavior.

Expected behavior

I expect the API not to throw an error when running wrangler to deploy my site.

Environment and versions

Fill out the following information about your environment.

@moonwell-octavius moonwell-octavius added the bug Something isn't working label Mar 18, 2022
@fullheart
Copy link

We have the same problem with Version v1.19.9. Thank you for fixing it.

@arthurhamon
Copy link

We encounter the same problem, we revert to previous version 1.19.8

@jplhomer
Copy link
Contributor

Investigating a bit, and we discovered that chrono::Utc::now().timestamp() + (60 * 5); when running in GitHub Actions containers tends to output a timestamp that is 10 minutes behind the actual time. This leads to the bulk put KV API error.

The answer might be to use expiration_ttl instead, but sounds like that also has implications cloudflare/workers-sdk#587 (review)

Another way to improve this would be to catch the error and stop the GH Action to prevent hanging 💰

@nsinghbakkt
Copy link

same we are stuck at the moment all of our deploys are failing .. Please advise

@fullheart
Copy link

We workarounded the problem by downgrading the package version in our bitpucket pipeline to 1.19.8:

  • package.json: wrangler: 1.19.8 (fixed Version)
  • bitbucket-pipelines.yml: yarn global add wrangler@1.19.8

@threepointone
Copy link
Contributor

This sucks, sorry about this y'all. We'll have a look on monday. The current workaround is to use 1.19.8

threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 20, 2022
This switches how we expire static assets with `[site]` uploads to use `expiration_ttl` instead of `expiration`. This is because we can't trust the time that a deploy target may provide (like in cloudflare/wrangler-legacy#2224).
threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 20, 2022
This switches how we expire static assets with `[site]` uploads to use `expiration_ttl` instead of `expiration`. This is because we can't trust the time that a deploy target may provide (like in cloudflare/wrangler-legacy#2224).
jakubsacha added a commit to analysis-tools-dev/website-old that referenced this issue Mar 20, 2022
Deployments are failing due to: cloudflare/wrangler-legacy#2224
jakubsacha added a commit to analysis-tools-dev/website-old that referenced this issue Mar 20, 2022
* Add offensive 360 sponsoring

offensive360.com joined as our sponsor. I added logo in aside and
readme.

* Add logo asset

* Update wrangler action

* Use wrangler 1.19.8

Deployments are failing due to: cloudflare/wrangler-legacy#2224

* Fix lighthouse

* Use actions/cache@v2

* Remove cache
@petebacondarwin
Copy link

Urgh! So we can't rely on the current time given in a Docker container to be accurate!!
I think using expiration_ttl will solve this.

@caass
Copy link
Contributor

caass commented Mar 21, 2022

This should be fixed with the latest release, v1.19.10. Can anyone confirm if this works for them or is still broken?

@caass
Copy link
Contributor

caass commented Mar 21, 2022

Reopening until we have confirmation that this is fixed

@caass caass reopened this Mar 21, 2022
@fwwieffering
Copy link

fwwieffering commented Mar 21, 2022

@caass I have a worker with ~ 3900 files in the KV store. These built up over the last few days when deleting files failed.

I don't think its indicative of the patch not working necessarily - but for KV stores of this size wrangler publish times out.

2022-03-21T17:49:17.8200716Z Using API Token authentication
2022-03-21T17:49:19.8534328Z /usr/local/bin/wrangler -> /usr/local/lib/node_modules/@cloudflare/wrangler/run-wrangler.js
2022-03-21T17:49:19.8534762Z /usr/local/bin/wrangler1 -> /usr/local/lib/node_modules/@cloudflare/wrangler/run-wrangler.js
2022-03-21T17:49:19.8593163Z 
2022-03-21T17:49:19.8593462Z > @cloudflare/wrangler@1.19.10 postinstall /usr/local/lib/node_modules/@cloudflare/wrangler
2022-03-21T17:49:19.8595009Z > node ./install-wrangler.js
2022-03-21T17:49:19.8595192Z 
2022-03-21T17:49:28.8701620Z 
2022-03-21T17:49:28.8704353Z found 1 high severity vulnerability
2022-03-21T17:49:28.8704726Z   run `npm audit fix` to fix them, or `npm audit` for details
2022-03-21T17:49:30.3936383Z  Built successfully, built project size is 12 KiB.
2022-03-21T17:49:32.9037406Z  Using namespace for Workers Site "<kv namespace>"
2022-03-21T17:49:38.8310088Z  Success
2022-03-21T17:49:38.8310297Z  Uploading site files
2022-03-21T17:49:43.7417744Z  Successfully published your script to
2022-03-21T17:49:43.7418254Z  <url>/* => stayed the same
2022-03-21T17:49:43.7418382Z 
2022-03-21T17:49:43.7418453Z  Deleting stale files...
2022-03-21T18:38:53.8751263Z Error: error sending request for url (https://api.cloudflare.com/client/v4/accounts/<account id>/storage/kv/namespaces/<namespace id>/values/assets/index.f268669f.df4cc6c294.js): operation timed out
2022-03-21T18:38:53.8751620Z 
2022-03-21T18:38:53.8751684Z Caused by:
2022-03-21T18:38:53.8751827Z     operation timed out

@caass
Copy link
Contributor

caass commented Mar 21, 2022

@fwwieffering I see...likely the cause of that is that index.f268669f.df4cc6c294.js is a large file, and we now download every file in the store to re-upload them with an expirationTtl so that they expire after a few minutes when the new files are uploaded.

An immediate fix that comes to mind would be to implement automatic HTTP retries for requests that time out, so we get 3 or 5 goes at downloading the files before giving up and throwing an error. I can work on this in wrangler 1 and raise an issue for wrangler 2.

It may be worth simply running wrangler publish again and seeing if you get the same error, since that's essentially what an HTTP retry fix would do.

@fwwieffering
Copy link

@fwwieffering I see...likely the cause of that is that index.f268669f.df4cc6c294.js is a large file, and we now download every file in the store to re-upload them with an expirationTtl so that they expire after a few minutes when the new files are uploaded.

I was worried more about the number of files than the size of them. running wrangler publish on 1.19.10 in a different environment with a kv size of about 650 (less deployment churn over the last few days) succeeded.

@caass
Copy link
Contributor

caass commented Mar 21, 2022

Ah, I see. Did running wrangler publish again give the same error?

@hubertott
Copy link

@fwwieffering I see...likely the cause of that is that index.f268669f.df4cc6c294.js is a large file, and we now download every file in the store to re-upload them with an expirationTtl so that they expire after a few minutes when the new files are uploaded.

I was worried more about the number of files than the size of them. running wrangler publish on 1.19.10 in a different environment with a kv size of about 650 (less deployment churn over the last few days) succeeded.

I too am concerned about the number of existing files in the kv store, we've been deploying headless applications for over a year now.

I am testing 1.19.10 at the moment and its still stuck at Deleting stale files... after 30mins.

@christofferh
Copy link

I am testing 1.19.10 at the moment and its still stuck at Deleting stale files... after 30mins.

Seeing the same thing here but with a kv size of ~12k items.

@petebacondarwin
Copy link

I believe that the Cloudflare API that Wrangler is using to expire these items is rate-limited to 1200 requests per 5 mins.
So I would not be surprised if 12K items is a problem since that is 10x as many requests as we are allowed in 5 mins.
Thinking about what the solution could be ...

@caass
Copy link
Contributor

caass commented Mar 22, 2022

TL;DR: This should be fixed with the latest release, v1.19.11. Upgrade and if you're still experiencing problems, please report them here <3.

We've decided to revert the behavior of workers-sites to what it was before -- deleting unused files in the KV namespace when new files are uploaded.

We had complaints where, on a wrangler publish, the site would be temporarily unavailable in the time between when the old files were deleted and the new files had propagated to the edge. We tried to fix this by, instead of deleting the old files, marking them for expiration instead. This had a couple problems:

  1. There's no way to set the expiration for an existing KV pair, so we had to download and then re-upload every file in the workers-sites KV namespace. This was problematic for users with high churn in their KV namespace, where in the five minutes before the files were set to expire, their namespace would balloon with new files and eventually cause API requests to fail as we downloaded more and more assets to mark them for expiration.
  2. Setting the "expiration" field was unreliable: the publish could occur from an environment with an incorrect system time, or the publish could take so long that by the time the tagged-for-expiration files were uploaded, they were rejected by the API for an expiration set in the past

We remedied the second issue by setting an time-to-live instead of a expiration timestamp, but by this point the first issue had become so exacerbated that publishes were simply failing to download (or maybe re-upload, we're still not sure exactly which API requests were failing) the files from their KV namespaces.

We could maybe solve our original problem by introducing an artificial delay of ~60s in-between publishing new assets and deleting old ones, but if you publish a new version of your site more often than once per minute, you would still run into the "my KV namespace is blowing up" issue.

We (@petebacondarwin, @threepointone and myself) chatted about this and think we're going to leave this behavior as-is for the time being -- unless of course the "fix" (reversion) introduced in 1.19.11 doesn't work -- in wrangler 1, and focus further work on this issue in wrangler 2. Additionally, since we do offer a more complete site hosting solution in Cloudflare Pages (docs), once the dust settles here the original complaint is likely to become more low-priority.

For those interested in moving their projects to Pages, we do have a migration guide that could be helpful. Wrangler2 will also have a dedicated pages subcommand, so keep an eye on that, and please leave feedback in the wrangler2 repo if the command doesn't do what you need (or expect) it to do.

Thank you so much to @ryan-moonwell for filing this issue and everyone who posted in the thread for providing valuable feedback and insight which allowed us to rather quickly iterate on this and reach what is (hopefully!) a working solution. I'll leave this open for now until we've either had reports of successful publishes or the issue goes stale

threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 22, 2022
We discovered critical issues with the way we expire unused assets with `[site]` (#666, cloudflare/wrangler-legacy#2224), that we're going back to the legacy manner of handling unused assets, i.e- deleting unused assets.

Fixes #666
threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 22, 2022
We discovered critical issues with the way we expire unused assets with `[site]` (see #666, cloudflare/wrangler-legacy#2224), that we're going back to the legacy manner of handling unused assets, i.e- deleting unused assets.

Fixes #666
threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 22, 2022
We discovered critical issues with the way we expire unused assets with `[site]` (see #666, cloudflare/wrangler-legacy#2224), that we're going back to the legacy manner of handling unused assets, i.e- deleting unused assets.

Fixes #666
threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 22, 2022
We discovered critical issues with the way we expire unused assets with `[site]` (see #666, cloudflare/wrangler-legacy#2224), that we're going back to the legacy manner of handling unused assets, i.e- deleting unused assets.

Fixes #666
threepointone added a commit to cloudflare/workers-sdk that referenced this issue Mar 22, 2022
We discovered critical issues with the way we expire unused assets with `[site]` (see #666, cloudflare/wrangler-legacy#2224), that we're going back to the legacy manner of handling unused assets, i.e- deleting unused assets.

Fixes #666
petebacondarwin pushed a commit to cloudflare/workers-sdk that referenced this issue Mar 22, 2022
We discovered critical issues with the way we expire unused assets with `[site]` (see #666, cloudflare/wrangler-legacy#2224), that we're going back to the legacy manner of handling unused assets, i.e- deleting unused assets.

Fixes #666
@koeninger
Copy link
Contributor

I don't understand that explanation. The expiry approach should have at most tripled the work (download X old files, upload to expire X old files, upload Y new files) versus the approach of just uploading Y new files. I'd expect X to be proportional to Y. To put it another way, if this approach doesn't work for 1,000 files I'd expect the old approach to stop working at 3,000 files, which is still a problem. It'd be a problem even if there was an api to set expiration on a key without uploading a value.

If you're dealing with X being much larger than Y because people were using versions of wrangler that didn't delete anything for a long time, and you're not handling rate limiting correctly to have a slow but eventually successful first publish on a namespace that's gotten bloated with a bunch of old files, the easiest way for a customer to deal with that is just rename the project and delete the old KV namespace.

@threepointone
Copy link
Contributor

We'll have to revisit the expiration logic, it's not clear what we were doing wrong; and it stands that we'll have to add rate limiting logic on our side at least, in addition to whatever else we were missing. That said, I'll assume this issue is fixed and folks are unblocked from making Sites deploys. Closing this issue.

@jdddog
Copy link

jdddog commented Mar 31, 2022

I'm still having problems with wrangler v1.19.11 that might be related to this, it throws an error when deleting stale files with no error message:

💁  Deleting stale files...
Error: 

Deleting stale files eventually works if I run the wrangler deploy command once again after it has failed.

It was working OK soon after v1.19.11 was released however the above error popped up today (or I only noticed it today). I'm not sure if it is related, we deploy around about 20K files (for around 5000 pages).

@petebacondarwin
Copy link

@jdddog - deleting 20K assets could be problematic in itself, with or without this fix... The API can only handle a few thousand assets at a time, so Wrangler should be batching them up. It is possible that an individual batch is failing for some reason. This would explain why the deletion works finally, after a number of attempts, since the some of the batches might be succeeding. 🤔

@jdddog
Copy link

jdddog commented Apr 4, 2022

@jdddog - deleting 20K assets could be problematic in itself, with or without this fix... The API can only handle a few thousand assets at a time, so Wrangler should be batching them up. It is possible that an individual batch is failing for some reason. This would explain why the deletion works finally, after a number of attempts, since the some of the batches might be succeeding. thinking

Thanks @petebacondarwin. I see, that makes sense. It used to work OK without hanging or giving errors, it has only recently become a problem. It is probably more like 10K assets that get updated each time (as some files stay the same and don't get deleted).

@petebacondarwin
Copy link

We need to add a bit more sophistication in how we do this upload/sync/delete of assets. If you are currently unblocked, then this is likely to be work targeting Wrangler v2.1.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.