Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to upload a file in chunks? #1192

Closed
gajus opened this issue May 13, 2020 · 12 comments
Closed

How to upload a file in chunks? #1192

gajus opened this issue May 13, 2020 · 12 comments
Assignees
Labels
api: storage Issues related to the googleapis/nodejs-storage API. type: question Request for information or clarification. Not an issue.

Comments

@gajus
Copy link

gajus commented May 13, 2020

I am loosing my mind here, since I think I've tried absolutely everything and I still cannot figure out what is the proper way to do it.

Suppose I have a 0.9MB file.

$ ll
-rw-r--r--@  1 gajus  staff   921K 13 May 14:38 test.png

I then split that file into 2 chunks:

$ split -b 1000000 test.png
$ ll
-rw-r--r--@  1 gajus  staff   921K 13 May 14:38 test.png
-rw-r--r--   1 gajus  staff   488K 13 May 14:56 xaa
-rw-r--r--   1 gajus  staff   433K 13 May 14:56 xab

Assuming I only have access to the resulting chunks (and they might be on different servers, i.e. must be two distinct operations), what is the correct way to upload test.png?

Here is what I tried:

(async () => {
  const googleStorageBucket = googleStorage.bucket('contrawork');
  const file = googleStorageBucket.file('images/test.png');
  const uri = (await file.createResumableUpload())[0];

  await googleStorageBucket.upload('./xaa', {
    gzip: false,
    offset: 0,
    predefinedAcl: 'publicRead',
    resumable: true,
    uri,
    validation: false,
  });

  await googleStorageBucket.upload('./xab', {
    gzip: false,
    offset: 500000,
    predefinedAcl: 'publicRead',
    resumable: true,
    uri,
    validation: false,
  });
})();

However, this uploads only first image. The second file is never appended.

images_test.png.zip

@gajus
Copy link
Author

gajus commented May 13, 2020

Then I thought that maybe offset ignores the first bytes from the input too (, which I could workaround by pre-filling a buffer of needed length with 0s). Therefore, I've tried:

(async () => {
  const googleStorageBucket = googleStorage.bucket('contrawork');
  const file = googleStorageBucket.file('images/test.png');
  const uri = (await file.createResumableUpload())[0];

  await googleStorageBucket.upload('./xaa', {
    gzip: false,
    offset: 0,
    predefinedAcl: 'publicRead',
    resumable: true,
    uri,
    validation: false,
  });

  const temporaryFileName = tmpNameSync();

  fs.writeFileSync(temporaryFileName, Buffer.concat([
    fs.readFileSync('./xaa'),
    fs.readFileSync('./xab'),
  ]));

  await googleStorageBucket.upload(temporaryFileName, {
    gzip: false,
    offset: 500000,
    predefinedAcl: 'publicRead',
    resumable: true,
    uri,
    validation: false,
  });
})();

But that just uploads exactly the same file.

@gajus
Copy link
Author

gajus commented May 13, 2020

There is no error or anything. It seems that PUT requests are simply ignored.

The first request/respone:

 PUT https://storage.googleapis.com/upload/storage/v1/b/contrawork/o?name=images%2Ftest.png&uploadType=resumable&upload_id=[..]

Content-Range:      bytes 0-*/*
Authorization:      Bearer [..]
User-Agent:         google-api-nodejs-client/5.10.1
x-goog-api-client:  gl-node/14.1.0 auth/5.10.1
Accept:             application/json
Accept-Encoding:    gzip,deflate
Connection:         close
Host:               storage.googleapis.com
Transfer-Encoding:  chunked

{
  "kind": "storage#object",
  "id": "contrawork/images/test.png/1589408553615624",
  "selfLink": "https://www.googleapis.com/storage/v1/b/contrawork/o/images%2Ftest.png",
  "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/contrawork/o/images%2Ftest.png?generation=1589408553615624&alt=media",
  "name": "images/test.png",
  "bucket": "contrawork",
  "generation": "1589408553615624",
  "metageneration": "1",
  "storageClass": "NEARLINE",
  "size": "500000",
  "md5Hash": "qCGVMqig4yOiO2yoeRnvNg==",
  "crc32c": "KQhJug==",
  "etag": "CIiC96HwsekCEAE=",
  "timeCreated": "2020-05-13T22:22:33.615Z",
  "updated": "2020-05-13T22:22:33.615Z",
  "timeStorageClassUpdated": "2020-05-13T22:22:33.615Z"
}

The second:

https://storage.googleapis.com/upload/storage/v1/b/contrawork/o?name=images%2Ftest.png&uploadType=resumable&upload_id=[..]

Content-Range:      bytes 500000-*/*
Authorization:      [..]
User-Agent:         google-api-nodejs-client/5.10.1
x-goog-api-client:  gl-node/14.1.0 auth/5.10.1
Accept:             application/json
Accept-Encoding:    gzip,deflate
Connection:         close
Host:               storage.googleapis.com
Transfer-Encoding:  chunked

{
  "kind": "storage#object",
  "id": "contrawork/images/test.png/1589408553615624",
  "selfLink": "https://www.googleapis.com/storage/v1/b/contrawork/o/images%2Ftest.png",
  "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/contrawork/o/images%2Ftest.png?generation=1589408553615624&alt=media",
  "name": "images/test.png",
  "bucket": "contrawork",
  "generation": "1589408553615624",
  "metageneration": "1",
  "storageClass": "NEARLINE",
  "size": "500000",
  "md5Hash": "qCGVMqig4yOiO2yoeRnvNg==",
  "crc32c": "KQhJug==",
  "etag": "CIiC96HwsekCEAE=",
  "timeCreated": "2020-05-13T22:22:33.615Z",
  "updated": "2020-05-13T22:22:33.615Z",
  "timeStorageClassUpdated": "2020-05-13T22:22:33.615Z"
}

@JustinBeckwith JustinBeckwith transferred this issue from googleapis/google-cloud-node May 13, 2020
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/nodejs-storage API. label May 13, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label May 14, 2020
@stephenplusplus
Copy link
Contributor

stephenplusplus commented May 14, 2020

Thanks for the detailed write up. As it stands, we aren't currently able to handle this. There are a few potential solutions I can think of:

My thoughts:

  • Save the chunks remotely, then combine them when complete:
    • bucket.combine(['fragment-1', 'fragment-2'], 'complete-file.txt')
    • Downsides: Cost penalty to create multiple files
  • Save the object remotely, and overwrite it to append it with the next chunk
    • Downsides: Cost/performance penalty to download and upload a growing file
  • Modify gcs-resumable-upload to handle this behavior (or a new method/library)
    • Downsides: Unsure this is a proper usage. Resumable uploads appear to have a minimum chunk size of 262144 bytes, so depending on the file, this could be unrealistic.

@frankyn is there a better way to handle this scenario? Also, please correct any mistakes I may have made in my breakdown above.

@stephenplusplus stephenplusplus added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels May 14, 2020
@frankyn
Copy link
Member

frankyn commented May 14, 2020

The main issue with using resumable upload in this scenario is that they're sequential, meaning you can't set the offset to where you'd like it to be in different servers.

Your best bet will be to use @stephenplusplus's first suggestion using combine(). What you would do is the following:

  1. Upload chunks from each server to a GCS bucket.
  2. After chunks are uploaded, call bucket.combine() on 32 objects at a time. (Order them in sequential order to be concatenated correctly.)
  3. Resulting combined object will be used in subsequent bucket.combine() calls to concatenate remaining objects.
  4. Delete the unnecessary chunks.

The nice part of this design is that your chunks can all be upload to GCS from different services without having to deal with ordering in terms of offsets. You'd need to order them more likely by the name of the object. GCS will handle concatenating the objects in atomic operations and provide the resulting object.

@gajus
Copy link
Author

gajus commented May 14, 2020

Doesn't this incur a significant additional cost?

@gajus
Copy link
Author

gajus commented May 15, 2020

This would be easily possible if the arbitrary (?) 262144 bytes minimum chunk restriction did not exist. With that restriction, it is impossible to develop this functionality.

The use case is a distributed file upload system.

@stephenplusplus
Copy link
Contributor

Because there's nothing we can do from our library, I'm going to close the issue. If anyone would like to chime in with more information or ideas, please feel free.

@frankyn
Copy link
Member

frankyn commented Jun 16, 2020

A lot of stuff has been going on and got behind on these issues.

@stephenplusplus, is it possible to make chunk size modifiable? IIUC, 256 KB multiple is a perf guidance but can be smaller.

If @gajus is willing to implement the necessary distributed file system without a compose it could be helpful in this case.

@gajus
Copy link
Author

gajus commented Jun 17, 2020

For a bit of context, this is what I was developing.

https://github.com/gajus/express-tus

The limitation of minimum per-chunk upload size made it impossible to use with Google Storage in a distributed system. The only way to make it work is by first uploading the file to our local storage and then upload it to Google Storage, which is far from perfect (because now user needs to wait ~2x upload time).

@frankyn frankyn reopened this Jun 17, 2020
@stephenplusplus
Copy link
Contributor

IIUC, 256 KB multiple is a perf guidance but can be smaller.

Sorry @frankyn, I missed this tag. The 256kb isn't our limit from this library, but from the upstream API.

@danielduhh
Copy link

FYI @gajus, chunkSize is now supported as an optional resumable upload parameter in the latest version of the client. See docs for details on usage.

@Chethan-sn
Copy link

Hi @danielduhh,
Could you please provide an example of how to use it? "chunkSize" is mentioned in CreateResumableUploadOptions. Even after mentioning the chunkSize it appears to be pushing only first chunk. I am trying to upload a file of 85MB and dividing it into 32MB chunks except last one.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/nodejs-storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

6 participants