Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0122] IPFS CID optionally on narinfo in binary caches #122

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions rfcs/0122-binary-cache-ipfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
feature: binary-cache-ipfs
start-date: 2022-03-07
author: lucasew
co-authors: (find a buddy later to help out with the RFC)
shepherd-team: John Ericson, Tom Bereknyei, Kevin Amado
shepherd-leader: Tom Bereknyei
related-issues: (will contain links to implementation PRs)
---

# Summary
[summary]: #summary

In binary caches add an extra property on narinfo to reference the IPFS CID of the nar file

# Motivation
[motivation]: #motivation

IPFS is still not a present reality on the mainstream Nix ecosystem, altough it's not reliable to store long term data, it can reduce bandwith costs for both the servers and the clients but the question is where the NAR file could be obtained in IPFS.

Its not espected that, for example, cache.nixos.org would run a IPFS daemon for seeding but it could just calculate the hash using `ipfs add -nq $file` and provide it on the narinfo so other nodes can figure out alternative places to download the NAR files, even closer than a CDN could be.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little concern is that a given file doesn't have exactly one CID. Depending on how you chunk the file you can get effectively unlimited different CIDs. This isn't a problem when the CID distributor starts the seed and the CID stays live on the network because whatever CID is advertised will be fetched. However for the case like this is matters a lot, because different settings will result in a would-be seeder generating the wrong CID.

IIUC the current default for ipfs add is fixed-size blocks of 262144B each (aka size-262144). However for a nixpkgs cache where subsequent versions of a derivation may be largely similar it may make more sense to do a smarter chunker based on a rolling hash.

Anyways, the exact chunking mechanism is bikeshedding, but what do we want to do about this? I see a few main options.

  1. Put the chunker into the narinfo so it can be reproduced. (I don't know if there is a well defined standard format but current go-ipfs uses strings like size-262144 and rabin-2048-65536-131072 which are pretty easy to understand and unlikely to be ambiguous.)
  2. Declare a chunker upfront and expect people to use it. (We can revert to 1 in the future by adding the chunker information later).
  3. Convince cache.nixos.org to also run an IPFS node that advertises the CIDs that are advertised in the narinfo files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rsync has a pretty interesting algorithm for syncing files https://stackoverflow.com/questions/1535017/rolling-checksums-in-the-rsync-algorithm , there maybe something in that, However probably not directly portable to IPFS and chunking.

I'd vote for 3! and get that working today (or perhaps tomorrow) and think about options 1/2 for the day after tomorrow (or some point in the future).

Thanks for your detailed analysis of this, my understanding of Nars on IPFS has increased!

Copy link
Contributor

@kevincox kevincox Mar 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically equivalent to the Rabin chunking. But the biggest problem isn't what algorithm to use but how to know what algorithm was used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this we could do like how we already do with hashes, like sha256:something

AFAIK ipfs has symbol friendly names for the chunking methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't care about the chunking algorithm. Please stop discussing this here.

What I care about is that we record the chunking algorithm in a way that someone who wishes to advertise this path can do so.


Parallel binary caches could arise for regions that internet connectivity is a problem and a local distribution is preferred. If the payload is properly signed it shouldnt be a problem to prove that given path comes originally from given binary cache.

# Detailed design
[design]: #detailed-design

A narinfo file is a file provided by the binary cache server that provides metadata for an existent path in the binary cache. It has information about the nix store path, which compression algorithm is used, hashes, sizes, references, a signature and a relative direct path to download the compressed NAR file.

It has the sha256 hash of the file but from that it's still not possible to find out where to download it on the IPFS network so, to make it possible, the CID is required.

This extra step can be optional so if the cache provider don't provide the IPFS CID it's fine but the provider cannot leverage IPFS to reduce bandwidth costs.

# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions

Today, a narinfo looks like this:

```
StorePath: /nix/store/gdh8165b7rg4y53v64chjys7mbbw89f9-hello-2.10
URL: nar/0i6ardx43rdg24ab1nc3mq7f5ykyiamymh1v37gxdv5xh5cm0cmb.nar.xz
Compression: xz
FileHash: sha256:0i6ardx43rdg24ab1nc3mq7f5ykyiamymh1v37gxdv5xh5cm0cmb
FileSize: 40360
NarHash: sha256:1ddv0iqq47j0awyw7a8dmm8bz71c6ifrliq53kmmsfzjxf3rwvb8
NarSize: 197528
References: 7gx4kiv5m0i7d7qkixq2cwzbr10lvxwc-glibc-2.27 gdh8165b7rg4y53v64chjys7mbbw89f9-hello-2.10
Deriver: 5sj6fdfym58sdaf3r5p87v4l8sj2zlvn-hello-2.10.drv
Sig: cache.nixos.org-1:K0thQEG60rzAK8ZS9f1whb7eRlIshlMDJAm7xvX1oF284H+PTqlicv/wGW6BIj+wWWONHvUZ2MYc+KDArekjDA==
```

This RFC proposes a new key-value pair that in this example would be:

```
IpfsCid: Qmf8NfV2hnq44RoQw9vxmSpGYTwAovA8FUCxeCJCqmXeNN
```

# Drawbacks
[drawbacks]: #drawbacks

It's an extra optional step for each cache entry

# Alternatives
[alternatives]: #alternatives

An alternative way is to use bittorrent, but bittorrent doesn't do file level deduplication so swarms can be easily divided but it's a lot battle proven and has a lot of clients that play well with each other. NARs are only single files so in this case it shouldn't be a problem.

# Unresolved questions
[unresolved]: #unresolved-questions

Who will seed?

IPFS and Nix stores are different things so IPFS would hold a chunked compressed nar file and Nix would hold the nar files extracted in it's stores. This could lead to double the usage of storage.

This RFC is only about easing binary cache propagation from a previously trusted entity (by default the NixOS official cache keys).

Is the signing system used in nix for cache entries robust enough?

# Future work
[future]: #future-work

Nix store integration with IPFS to avoid storing the same thing twice and improve seeder availability

Trustix: finding consensus about what is the right closure of the derivation