Sharding IPLD objects #76

jbenet · 2015-12-02T14:50:05Z

This note gathers the constraints + will drive toward a design of object sharding in IPFS and IPLD. Object sharding is the algorithms and formats used to represent a single (virtual) large object out of many smaller ones. Think of this like the way large directories are represented in modern filesystems (RB-Trees, B-Trees, HTrees, etc).

Sharding IPLD objects in general is a useful thing. instead of implementing it for unixfs and other datastructs each time, we could implement it once. it could be a datastruct the others employ, or maybe -- if it is simple enough -- it belongs as part of IPLD itself.

Constraints to support:

efficient in the small case (1 to 5 nodes)
allows user-chosen sharding (eg for small numbers of nodes, may want specific construction)
large fanouts (millions or billions)
efficient access
minimize insertion re-writes (shadowing/cloning)
upgradeable algorithms (can signal which sharding algo via version, or even with a key/val)
union style fanouts
hierarchical style fanouts (patricia tries)

For large fanouts, look at

all filesystems research into indirect block topologies
ext4 dir entries (list + tree) https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries
HTree https://en.wikipedia.org/wiki/HTree
PHTree http://phunq.net/pipermail/tux3/2013-January/000026.html
B-trees + shadowing + clones http://liw.fi/larch/ohad-btrees-shadowing-clones.pdf
djb's cdb http://cr.yp.to/cdb.html
@tv42's set + multiset on ipfs (used already for 0.4.0+ pinset) pinning index (as ipfs objects) and cdb discussion #4 (comment)

case for supporting it on-top of IPLD

It is nice that the IPLD spec is very simple. Finding a nice way to support this without complicating it much will be hard-- the constraints above do not bode well for this.
can define it as a different datastruct, should not be hard for other datastructs to extend it
flexible algorithms for sharding may complicate IPLD

case for supporting it in IPLD

we could have a very powerful datastructure if sharding came everywhere
merkle-linking in IPLD is already like hierarchical fanout sharding of a single massive tree, this is just sharding within a single level.
IPLD already has flexible algos in multicodec
could use a directive like @shard or something
could be an IPLD extension if not properly in core spec.

cc @whyrusleeping @lgierth @diasdavid @cryptix @ion1 @mildred @tv42 @wking

The text was updated successfully, but these errors were encountered:

ion1 · 2015-12-03T09:13:19Z

It might be useful to take advantage of chunking by a rolling hash with large flat data structures receiving small arbitrary changes, such as a directory.

davidar · 2015-12-17T01:52:25Z

Radix trees have the benefit of being deterministic (invariant to the order of insertions), which helps with deduplication.

Edit: also see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452

jbenet · 2016-01-19T10:54:26Z

cc @whyrusleeping comment here.

jbenet · 2016-01-19T10:58:22Z

From the IPLD spec, one very trivial but extensible thing to do all-IPLD-object-wide is something like this:

{
  "@extends": [
    "/ipfs/<hash1>",
    "/ipfs/<hash2>",
    "/ipfs/<hash3>",
  ],
  ... // more properties
}

which would merge the objects like so:

o1 = get(/ipfs/<hash1>)
o2 = get(/ipfs/<hash2>)
o3 = get(/ipfs/<hash3>)
o4 = get(/ipfs/<hash4>) // our object above

merged = _.extends(o1, o2, o3, o4)

meaning that we start with the first, and we patch in order. so a property in o4 replaces those preceding it. this is very simple and can likely be used to support the desires mentioned earlier in this issue (more complicated structures).

It would be useful to have examples of directory sharding with:

B-trees + shadowing + clones
a Radix tree

expressed as the corresponding JSON or YML syntax for IPLD. (i use JSON above because it tends to be clearer, but i prefer YML). that way we can see what it takes to support it natively in IPLD.

davidar · 2016-01-19T11:25:23Z

Personally I'd like it if this was a core part of IPLD, so that large objects were transparently sharded without the user having to explicitly do anything (much like the transparent merkle-linking). Here's an example of a manual radix tree though:

r:
  om:
    an:
      e: 1
      us: 2
    ulus: 3
  ub:
    e:
      ns: 4
      r: 5
    ic:
      on: 6
      undus: 7

Sub-trees can be split into separate (merkle-linked) objects, analogously to what we already do with chunking linear streams.

jbenet · 2016-01-19T11:31:59Z

also https://en.wikipedia.org/wiki/HAT-trie

mildred · 2016-01-19T22:23:48Z

I think this should not be implemented below the basic IPLD data model: you want, depending on the application, to control where to split objects. For example, metadata should be easily accessible while the complete file data could take a little more time to load.

I think this should be an extension of IPLD. And you could have the choice to:

fetch only the requested object
fetch the logical record composed of multiple objects reassembled into a bigger one

It might be necessary to think how exactly we are going to reassemble the objects. Which algorithms. In particular. Especially considering that's not just about tree merging, but also data string merging.

For example, an object might be composed of a huge binary data string (for example the object representing a huge file). This data string shoud be able to be split among different physical IPLD objects.

The way I would implement it is to have index objects that links to sub parts, and specify with the link the way they are merged. Something like jbenet suggested

@davidar I disagree with you when you say that it should be completely transparent. The application should have control over the splitting of files. We could have a library that does a good job of splitting when the application author doesn't want to bother, but we should be able to have full control.

jbenet · 2016-01-24T13:24:27Z

Agree in full with @mildred

jbenet · 2016-06-22T06:50:40Z

Some links for references.

https://en.wikipedia.org/wiki/Hash_array_mapped_trie
search the persistent data struct lit more.
http://www.staff.city.ac.uk/~ross/papers/FingerTree.html
I started making a "recursive hash table", that's similar to typical associative array optimizations.
at the extreme of that lie thigns like Judy Arrays (tries)
And ofc, I keep returning to finger trees, and patricia tries.
https://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf

A tricky problem here is tuning the trade off between "good datastructures that create very small nodes" and "good sized objects to minimize the hashing costs, cache misses, and network ops". Basically, there's sharding, which represents one object with many small ones. and then there's aggregation, to represent many small objects with a big one. Most good approaches to sharding from data struct literature yield excellent datastructs with many small objects, which we'd want to aggregate. aggregation is tricky, because making it stable (convergent, for hashing) and efficient (minimize update costs) is challenging. Aggregation is not straight forward; because "bundling many objects in the underlying transport instead of in the data model" is not enough, as that does not minimize hashing and random access advertisements.

jbenet · 2016-06-22T08:20:48Z

Parameters

the max bucket size (fanout) should be configurable, and stored in the HT root node.
likely HAMT params should be configurable

Paging

keeping nodes in memory is a tough choice. will want to amortize loading things. and will also want to "page things out".
we're going to run into this everywhere.
ex: mfs pages out files, but maybe keeps dirs in memory for editing. what happens when the dir hierarchy is too big? will need to page out too.
we should be able to use these algos and restrict their memory usage, and have them be able to respond to memory pressure (on proc, and just programmatically) -- suggests a sub-proc abstraction.
at the very least we should make sure to address in the algo specs how to deal with memory pressure how to:
- page in (load)
- page out (should just be release for gc)

Writes

writing for all persistent datastructures involves:
- bubbling writes (bubbling up to the root)
- creating many intermediate objects
- many of which are never pointed to by any externalized root. (meaning the objects are never accessible, so writing them is strictly a waste of resources.)
algo specs should describe how to do:
- sync writes (that write all new objects to ipfs)
- in memory only writes (avoid writing to ipfs until datastruct is finalized)
- and coalesced writes (write, but defer writing when under heavy editing pressure)

Hashing

use non-cryptographic hash functions

other latencies in the read/write pipeline may obviate this
however, once a datastructure is paged in, reads should be able to be very fast

use seeds in children/intermediate HT nodes deterministic to their position in the tree/branch.

HAMT does not need this because it uses the same hash from the root.
but note the HAMT papers say "when run out of hash bits, get more hash bits".
this just means extending the hash function for the given seed in a deterministic way.
example 1:
- good: hash(name, n) => hashfn(n + name) (start this way from 0).
- bad: hash(name) => hashfn(name), then later, n > 0: hash(n + seed).

example 2 (good):

hash(name, n) => {
  if (n > 0) 
    name = hash(name, n-1)
  return hashfn(name)
}

// hash(name, 0) = hashfn(name)
// hash(name, 1) = hashfn(hashfn(name))
// hash(name, 2) = hashfn(hashfn(hashfn(name)))

@whyrusleeping confirm all this made sense? o/

Differentiating Leaves

algos should specify how to differentiate leaves. (i think HAMT does)

one easy way in IPLD objects though is this https://github.com/jbenet/ipld-ht-experiment/blob/master/rht.js#L137-L150 -- assume nodes that are just IPLD links are always intermediate nodes, and leaves are a nested object (which carry the name/key being inserted). eg

{
  // an intermediate datastructure node.
  children: [
    {"/": "Qmaaa..."}, // just a link. intermediate
    {"/": "Qmbbb..."}, // just a link. intermediate
    {"/": "Qmccc..."}, // just a link. intermediate
    {"n": "key", "v": {"/": Qmvalue}}, // leaf node (value entry)
    {"/": "Qmeee..."}, // intermediate
  ]
}

whyrusleeping · 2016-07-04T03:27:07Z

@jbenet should the names on links pointing to the intermediate nodes just be their index? I think that makes the most sense.

jbenet · 2016-07-04T06:50:37Z

@whyrusleeping not sure what the index buys you, but sure, that works for the protobuf version. but not the IPLD version.

I think actually that we should make ONE version (IPLD naturally), and project it into protobuf, so that the things are interoperable with IPLD from the get go, and not have two different HAMT implementations.

jbenet · 2016-07-08T22:08:25Z

Great article on map data structures for external memory — i.e. ways of storing indexed data that are optimized for storage on disk (or networks :) ).

http://blog.omega-prime.co.uk/?p=197

whyrusleeping · 2016-08-05T09:52:10Z

@jbenet How does a sharded directory work with paths? Should paths be resolved in the context of unixfs? or strictly with merkledag links?

i.e. a sharded node has links:

{
  Links: {
    "AB" : { 'child shard' },
    "FE" : { 'child shard' },
    "12" : { // this is a child shard too, just showing it for the example
      "Links" : {
         "D4ActualDirName" : "dirhash",
      }
    }
}

this would be a sharded directory that contains (among other things) a directory named "ActualDirName". To access this, I'd expect to be able to do /ipfs/<shardhash>/ActualDirName, but /ipfs/<shardhash>/12/D4ActualDirName is the actual path through the raw DAG.

jbenet · 2016-08-06T04:52:04Z

unixfs should allow resolving the virtual dag transparently.

jbenet · 2016-08-06T04:52:14Z

https://github.com/krl/ipfs-crdts

daviddias · 2017-08-15T11:56:46Z

And update on efficient immutable collections that takes on HAMT and CHAMT:

https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf

HN discussion

Kubuxu · 2017-08-15T12:07:48Z

Getting HAMT into IPLD would be major step in increasing its usability. It can be done in two ways: by introducing it at resolver layer (it would be transparent), or buy writing out clear spec and few reference implementations on application layer.

I was helping @magik6k to work on Wikipedia Search using IPLD, he had to build his own hashmap structure (which was static and needed full rebuild everything, also parameters weren't autotuned).
This increased time needed for this project significantly.

Kubuxu · 2017-08-15T12:09:44Z

Also AFAIK we already use partially use principles of CHAMT as we don't allocate full array, and we use bitmap masking and hashing.

daviddias · 2017-08-15T12:13:12Z

@Kubuxu agreed, I think the direction that everyone wants to is have two layers of IPLD, the first being the one we implemented (The resolver) and the second one that understands things like sharding and other data structs for custom path handling. I know that a few people put a lot of thought into this and we ended up deciding to implement the base layer first, now it seems the right time to start figuring out that second layer.

jbenet mentioned this issue Dec 2, 2015

Apply chunking to large link list objects ipfs/ipfs#96

Closed

jbenet mentioned this issue Dec 16, 2015

Sharding - unixfs support for large directories ipfs/specs#32

Open

davidar mentioned this issue Jan 19, 2016

Wikipedia ipfs-inactive/archives#20

Open

daviddias mentioned this issue Jun 1, 2016

Support for UnixFS Sharding + support balanced & trickle dag ipfs-inactive/js-ipfs-unixfs-engine#45

Closed

Kubuxu mentioned this issue Jul 3, 2016

Increase 512 kbyte limit for ipfs object put ipfs/kubo#2938

Closed

kevina mentioned this issue Aug 5, 2016

Implement unixfs sharding ipfs/kubo#3042

Merged

JustinDrake mentioned this issue Nov 29, 2016

Fresh perspective from learning IPFS ipfs/specs#145

Open

daviddias mentioned this issue Mar 24, 2017

🚀 0.23 Release 🌟 ipfs/js-ipfs#795

Closed

22 tasks

daviddias added the IPLD label Aug 28, 2017

cobordism mentioned this issue Dec 26, 2017

SIP - Changes in manifest handling (the trailing slash problem and other issues) ethersphere/swarm#178

Open

jvmncs mentioned this issue Apr 30, 2018

Extend Grid to take advantage of IPFS file sharding OpenMined/PyGrid-deprecated---see-PySyft-#181

Closed

daviddias mentioned this issue Jun 25, 2018

Support for bigger dag nodes out of the box ipld/js-ipld-dag-cbor#52

Closed

Stebalien mentioned this issue Dec 11, 2018

Links v2 ipld/specs#83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding IPLD objects #76

Sharding IPLD objects #76

jbenet commented Dec 2, 2015

ion1 commented Dec 3, 2015

davidar commented Dec 17, 2015

jbenet commented Jan 19, 2016

jbenet commented Jan 19, 2016

davidar commented Jan 19, 2016

jbenet commented Jan 19, 2016

mildred commented Jan 19, 2016

jbenet commented Jan 24, 2016

jbenet commented Jun 22, 2016 •

edited

Loading

jbenet commented Jun 22, 2016

whyrusleeping commented Jun 22, 2016

whyrusleeping commented Jun 22, 2016

jbenet commented Jun 29, 2016

jbenet commented Jun 29, 2016 •

edited by daviddias

Loading

whyrusleeping commented Jul 4, 2016

jbenet commented Jul 4, 2016

jbenet commented Jul 8, 2016

whyrusleeping commented Aug 5, 2016

jbenet commented Aug 6, 2016

jbenet commented Aug 6, 2016

daviddias commented Aug 15, 2017

Kubuxu commented Aug 15, 2017 •

edited

Loading

Kubuxu commented Aug 15, 2017

daviddias commented Aug 15, 2017

Sharding IPLD objects #76

Sharding IPLD objects #76

Comments

jbenet commented Dec 2, 2015

case for supporting it on-top of IPLD

case for supporting it in IPLD

ion1 commented Dec 3, 2015

davidar commented Dec 17, 2015

jbenet commented Jan 19, 2016

jbenet commented Jan 19, 2016

davidar commented Jan 19, 2016

jbenet commented Jan 19, 2016

mildred commented Jan 19, 2016

jbenet commented Jan 24, 2016

jbenet commented Jun 22, 2016 • edited Loading

jbenet commented Jun 22, 2016

whyrusleeping commented Jun 22, 2016

whyrusleeping commented Jun 22, 2016

jbenet commented Jun 29, 2016

jbenet commented Jun 29, 2016 • edited by daviddias Loading

Parameters

Paging

Writes

Hashing

Differentiating Leaves

whyrusleeping commented Jul 4, 2016

jbenet commented Jul 4, 2016

jbenet commented Jul 8, 2016

whyrusleeping commented Aug 5, 2016

jbenet commented Aug 6, 2016

jbenet commented Aug 6, 2016

daviddias commented Aug 15, 2017

Kubuxu commented Aug 15, 2017 • edited Loading

Kubuxu commented Aug 15, 2017

daviddias commented Aug 15, 2017

jbenet commented Jun 22, 2016 •

edited

Loading

jbenet commented Jun 29, 2016 •

edited by daviddias

Loading

Kubuxu commented Aug 15, 2017 •

edited

Loading