Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

car sharding #1

Open
thewtex opened this issue Feb 2, 2022 · 4 comments
Open

car sharding #1

thewtex opened this issue Feb 2, 2022 · 4 comments

Comments

@thewtex
Copy link

thewtex commented Feb 2, 2022

@d70-t awesome work!!

What do you think about car sharding? That is, for a large dataset, the arrays in the dataset are split into separate car files. For large arrays, they may be split into multiple car files like carbites.

The motivation is to support upload to tools like web3.storage, estuary.tech for datasets larger than 32 GB.

@d70-t
Copy link
Owner

d70-t commented Feb 2, 2022

Yes, car sharding should be there! I'm just wondering a bit if it should be part of the ipldstore or separate?

We could use car sharding for packing small chunks into larger objects for an object store (e.g. S3) and we could also pack even more cars into a larger car to send them off to some "high(er) latency storage" like tape or filecoin etc...

I'd expect this to work particularly well if the CARs contain groups which one would normally use for zarr-sharding. Depending on how the zarr keys are built (morton code?), this may or may not correspond to the carbites Treewalk method.

Probably CARv2 (== CAR + index) would even make up a nice structure for zarr shards.

@thewtex
Copy link
Author

thewtex commented Feb 3, 2022

Yes, I do not know if it should be part of the ipldstore or something separate.

I am thinking about the use case of building a large zarr dataset on a cluster.

  • Is there one ipfsstore or multiple?
  • Can we build the cars on the nodes where the data was generated?
  • How is the entire dataset composed?

CARv2 does look quite nice and helpful!

@d70-t
Copy link
Owner

d70-t commented Feb 3, 2022

Distributed writes should definitely be possible. My guess would be that we want to have one ipldstore per worker / thread etc... As a result, every worker would only see what has been written by itself (or what would have been preloaded into the ipldstore before). I believe that this should be ok for most use cases, but I don't know for sure yet.

The resulting IPLD objects would be dumped somewhere by whatever transport method suits best. An option would be to generate CARs locally and send them off.

The roots (i.e. what freeze() returns) of each ipldstore would have to be collected and sent to some function which merges the individual trees into one. This function would only operate on higher level blocks and CIDs of the leaves, so it would need much less resources and could probably run on a single node. However, it would also be possible to scatter that to multiple nodes (e.g. by doing one job per variable etc...). The idea of a CRDT may come in handy at this point.
It may be helpful (e.g. to bypass the time required for sending big CARs to storage) to create additional partial CARs on each worker which contain only those pieces which might be required by the merging function and collect those together with the roots.

@d70-t
Copy link
Owner

d70-t commented Feb 3, 2022

Another question may be if some reorganization of the "big" CARs would be needed, in case the set of IPLD objects created by one worker doesn't match the sharding one would like to have on read. But maybe that should be regarded as an implementation detail of the system which ends up hosting the CARs / IPLD objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants