Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: make Cargo embed dependency versions in the compiled binary #2801

Closed

Conversation

Shnatsel
Copy link
Member

@Shnatsel Shnatsel commented Nov 3, 2019

Rendered

This is the first step towards a proper security update story for Rust. You might also want to check out the comments on a prototype implementation that informed this RFC.


2023 update: this has been implemented out-of-tree as cargo auditable, which has garnered significant adoption. This RFC has been updated and expanded based on the lessons learned from it.

@igor-petruk
Copy link

One idea is to wrap Cargo.lock into another trivial format: yaml, json, custom text with two keys: ContentType and Body and you can version your format if Cargo.lock becomes cumbersome (for some yet unknown reasons) and will have to be replaced with something else.

@Alex-Addy
Copy link

I would request that this is not enabled by default. As noted by RustDragon on reddit, this could leak potentially sensitive information through the embedded paths.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 3, 2019

As described in the RFC, not having this enabled by default pretty much defeats the point.

Would disclosing that the dependency originated from a local copy, but not disclosing the directory path be an acceptable solution? This would protect information such as usernames and paths.

@Alex-Addy
Copy link

I'm not sure that not having it enabled by default would defeat the point. Having this information available to be enabled would still be immensely valuable to production users. My main concern is users accidentally having their information leaked just by handing out a binary.

@Alex-Addy
Copy link

Would disclosing that the dependency originated from a local copy, but not disclosing the directory path be an acceptable solution? This would protect information such as usernames and paths.

Having thought about this a bit more I think that this would be fine. It might be desirable to be able to turn path embedding back on (for full auditing), with it being disabled by default (for safety).

@Lokathor
Copy link
Contributor

Lokathor commented Nov 3, 2019

Yeah this should definitely be off by default, even if the info is scrubbed of paths and stuff.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 3, 2019

@Lokathor could you provide a rationale? I think the RFC makes a pretty strong case for the on-by-default approach. Prior art from Go and (as I've just learned) Ruby supports this as well.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 3, 2019

It has also been pointed out that this may interact with reproducible builds in interesting ways - I'm not sure if Cargo.lock is guaranteed to have a stable sorting order.

@Diggsey
Copy link
Contributor

Diggsey commented Nov 3, 2019

I don't necessarily think this should be off by default if the size increase is indeed as minimal as claimed: even in release mode some debug information is included already. I do think it should be removed when the binary is stripped though.

The RFC claims a sub 1% increase in size: I would like to see some more evidence to back that up; the "hello world" example is not enough. The number of dependencies increases the lock file size linearly, but it does not increase the size of the binary linearly due to unused code being removed at link time, so the relative size could go up. Furthermore, platform-specific dependencies are tracked in the lock file but are not compiled into the binary at all on other platforms.

Regarding reproducible builds: as long as the information in the Cargo.lock is encoded in a deterministic way, it shouldn't be a problem. If cargo picks a dependency version non-deterministically then your deterministic build is broken regardless of whether you incorporate the lock file directly.

One thing not covered by this RFC is what crate types will embed this information. Should the "cdylib" type embed the information? On Windows, DLLs are just executables with a different extension so you could reasonably argue that the same version information should be included.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 3, 2019

The RFC claims a sub 1% increase in size: I would like to see some more evidence to back that up; the "hello world" example is not enough.

Good point, will do.

Regarding reproducible builds: as long as the information in the Cargo.lock is encoded in a deterministic way, it shouldn't be a problem.

I don't think Cargo.lock has any guarantees regarding stable sorting order for dependencies, for example.

One thing not covered by this RFC is what crate types will embed this information.

It's under "unresolved questions" since I don't have a strong stance on this, but I am leaning towards "yes" for the same reasons you've described.

@Lokathor
Copy link
Contributor

Lokathor commented Nov 3, 2019

Bonus data point: A minimal hello world on Windows is ~3.5k, and the Cargo.lock for that is ~1.9k.

The ratio generally gets better as your program grows. The amount of code that usage of a dependency puts into your binary is usually a lot more than the line or two it takes to list the dependency.

Dependencies that hurt the ratio (adding a dependency entries but not adding code into the final binary) would likely be things like proc-macro crates, macro_rules!-only crates (eg: cfg-if), and bindings crates (eg: winapi).

That said, I still don't want this in my programs because I just don't want extra stuff going in my files.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 3, 2019

Size comparisons on binaries of varying sizes, as measured on Linux with ThinLTO:

exa is 2.6Mb stripped, 1.5Mb unstripped, 25Kb Cargo.lock. <1% vs unstripped, <2% vs stripped
cargo-audit is 6.5Mb unstripped, 3.6Mb stripped, Cargo.lock is 41Kb. <1% vs unstripped, <2% vs stripped
cargo-geiger is 17Mb unstripped, 12Mb stripped, Cargo.lock is 95Kb. <1% even vs stripped
ripgrep is 32Mb unstripped, 4.5Mb stripped. Its Cargo.lock is 24Kb, <1% even vs stripped

We can do a Crater run to be sure.

@gmorenz
Copy link

gmorenz commented Nov 3, 2019

reproducible builds

It seems like we should aim to be reproducible even if the crates came from different sources. I.e. downloading something from a registry vs using a local path shouldn't change things.

Maybe we could use a format this is just a sorted sequence of (crate-name, version, hash) instead of directly embedding Cargo.toml? This would also resolve almost all of the information leak concerns.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 3, 2019

I fear omitting the registry where a certain dependency came from would mean losing too much information. This defeats use cases such as only allowing dependencies from a private registry, for example.

@tarcieri
Copy link

tarcieri commented Nov 3, 2019

reproducible builds

You can't reproduce a build without Cargo.lock, so if anything, including it in the binary is a helpful forensic artifact for reproducing its build.

@Ixrec
Copy link
Contributor

Ixrec commented Nov 3, 2019

Since build system and package management system are usually decoupled, most other languages did not have the opportunity to implement anything like this.

For prior art and motivation completeness: Where I work at Bloomberg, our C++ codebase has enough control over such things that we rolled our own version embedding tricks many years ago. On top of the security benefits, this has been utterly invaluable for diagnosing many bugs in production, and ensuring that fixes have been fully released. The specific format it uses doesn't make much sense for an open source ecosystem in a language with a proper module/package system, but I thought it was worth mentioning this sort of thing has precedent even in C++.

@rotty
Copy link

rotty commented Nov 3, 2019

As suggested by @Shnatsel, I'll drop a mention of my cargo-audit-tags Python script. It allows a similar check based on the source code, checking the historical Cargo.lock files in git, based on git tags. This allows checking released artifacts based on those tags for known vulnerabilities in dependencies. At work, we have this running as part of a daily CI job, which will send an email if vulnerabilities are detected.

Of course, this does not cover the use case of the RFC, but it is a related tool that has some overlap with the goals of the RFC.

@joshtriplett
Copy link
Member

First of all: thank you for working on the problems of dependency tracking and security analysis! Please don't take any of the feedback below as suggesting that I disagree with the goals.

I agree with the various other posts giving a rationale of not wanting to disclose excessive information, especially about local paths and registries. You need to scrub all of the potentially sensitive information, such as local registries, paths, and repository URLs.

A quick check on my system turns up Cargo.lock files upwards of 100k-140k.

I'm also concerned about exposing the potentially evolving Cargo.lock format (as mentioned in your unresolved questions section); that format has changed substantially over time.

I think we could reduce that format substantially to serve this particular purpose. But at the very least, it seems worth compressing it before embedding it. A quick check suggests that basic zlib compression reduces the size of Cargo.lock files by 75-85%.

You also need to consider the size of the delta from one version of a binary to another, as many distribution mechanisms want to distribute deltas. I can easily imagine the Cargo.lock delta taking up more space than the delta to the code itself. (Note that deltas interact with the issue of compression above; tools to compute deltas need to know about use of compression.)

You should discuss the issue of format versioning. You could theoretically evolve this format by using a new section name if the version changes.

Given that this uses a completely Rust-specific (and cargo-specific) format, the section name should not use something generic like .dep-list.

A "Hello World" on x86 Linux compiles into a ~1Mb file in the best case (recent Rust without jemalloc, LTO enabled).

That's a problem that people are trying to fix. And that's with std. It's possible to build binaries less than 1kB today.

WASM, asm.js and embedded platforms are excempt from this mechanism since they have very strict code size requirements.

(Typo: s/excempt/exempt/)

People care about code size on other platforms, too, and the same target may get used for large applications and embedded development. You can't reliably classify every target into "embedded" and "not embedded".

Also, this isn't just about code size; some use cases need to carefully construct binaries and care about any "unusual" sections that get generated in the binary. Some binaries will get built with linker scripts.

You don't specify any mechanism to enable/disable this. You need to provide and specify appropriate configuration for that.

For a first pass, I would propose addressing the issues above, and then proposing a version that's initially disabled by default. This will let people experiment with it, optimize it, audit the information it includes, and so on. We can always revisit questions of defaults and their interaction with various use cases in the future.

@vincentdephily
Copy link

Regarding the "leaking sensitive information" issue, beyond the obvious "strip local paths and private urls" mitigation, it's useful to think in terms of threat model. How does embedding version info affect the balance between attacker and defender ?

  • The version info is only beneficial if the system is dilligently updated.
  • A a system that can't be or isn't regularly updated could instead argue for security by obscurity.
  • An attacker typically has more practical ways to test for a vuln than reading the binary.
  • When dealing with zero-day vulns, the version info only benefits the attacker.

For most cases I think the version info is a net gain for the defender (and therefore it should eventually be on by default once it matures), but there'll always be cases when it isn't.

It's not just about security: as a user I may want to recompile my binaries when a performance fix is out for example.

@tarcieri
Copy link

tarcieri commented Nov 4, 2019

When I worked on an infosec team at a large payments company, we built a system based on Ruby's Gemfile.lock, which I think is quite similar to the goals of this proposal (and also as mentioned in the proposal).

It scanned filesystems for these files, audited them against a similar vulnerability database (RubySec), opened tickets for vulnerable apps, and auto-closed them when they were updated to non-vulnerable versions.

It seems there's a lot of concerns about potential leakage of internal repositories. However for the above feature to work, we needed that information (which we also used to file internal vulnerabilities against internal packages). Far from that information being a concern, it was something we actually leveraged in an enterprise context.

For enterprises deploying in-house applications, I don't think leaking this information is any concern whatsoever.

For crates published on crates.io, it also isn't a concern as these crates, by definition, can't refer to internal repositories.

So the real information leakage concern is: publicly published binaries (e.g. through a vendor's web site or an app store) leaking this information accidentally because the publishers did not opt out of removing it. To me this is a somewhat valid concern: it's information exposure and a potential loss in defense-in-depth.

But I do just want to make clear that I think, with that acknowledged, this is also a feature that enterprise users will want for internal binaries, to allow internal auditing of deployed application binaries.

@epage
Copy link
Contributor

epage commented Oct 4, 2023

We talked about this in a recent cargo team meeting and there seemed to be interest in first focusing on identifying the needs and seeing what part cargo should be playing in solving the larger SBOM situation. If there is a built-in solution, I assume the extent of the information will mean it would be large and not reproducible, so we'd like want to start with the separate artifact. From there, we can further explore the aims of embedding a subset of the SBOM in the binary.

@Shnatsel would you be interested in pivoting this RFC in that direction and leading or taking part in that effort?

@Shnatsel
Copy link
Member Author

Shnatsel commented Oct 4, 2023

I just got contracted to advance cargo-cyclonedx. So I am getting quite involved in the broader issue of SBOM generation, and I've done a bit of research on what is required for it to work well.

There are some gaps in the information that cargo metadata exposes, most notably the dependency hashes not supporting the feature resolver v2; but I do not see any fundamental reason for SBOM generation not to be built on top of stable APIs of Cargo, such as cargo metadata, once those issues are addressed. The fact that we are dealing with generating a separate SBOM file as opposed to injecting something into the build process also makes things easier. So I don't think it makes sense to build generic SBOM handling into Cargo when external tools can do it equally well (and potentially move faster).

I am planning to migrate cargo cyclonedx to cargo metadata as the backend as part of this contract work, so in a month or two we are going to see if cargo metadata is actually suitable for SBOM use cases, or is there something I'm overlooking.

@jcgruenhage

This comment was marked as resolved.

@epage
Copy link
Contributor

epage commented Oct 4, 2023

I am planning to migrate cargo cyclonedx to cargo metadata as the backend as part of this contract work, so in a month or two we are going to see if cargo metadata is actually suitable for SBOM use cases, or is there something I'm overlooking.

For some of the people I talked to about SBOMs, cargo metadata sounded insufficient:

For both cases, we'd need all build inputs (RUSTFLAGS, enabled features, etc) as well as decisions made by -sys build scripts, etc

@epage

This comment was marked as resolved.

@jcgruenhage

This comment was marked as resolved.

@Shnatsel
Copy link
Member Author

Shnatsel commented Oct 4, 2023

For both cases, we'd need all build inputs (RUSTFLAGS, enabled features, etc) as well as decisions made by -sys build scripts, etc

All of this can be captured with the machinery already used by cargo auditable. It relies solely on the stable interfaces to Cargo and rustc. Enabled features are already collected, and RUSTFLAGS aren't but trivially could be, and "decisions made by -sys build scripts" are a whole separate problem.

Generic SBOM generation is a much bigger scope with a much bigger design space, so I am a bit puzzled how that would make a reasonable starting point. The scope of this RFC is intentionally very narrow, and the format is also designed so that we can easily extend it later if other compelling use cases emerge. This is done specifically to make the change as straightforward and uncontroversial as possible.

@tarcieri
Copy link

tarcieri commented Oct 5, 2023

@epage reading previous comments, it sounds like your concern regarding missing SBOM data is about false positive elimination, however this is a known and longstanding issue with e.g. cargo audit in general rustsec/rustsec#21

It's something we've long been meaning to address, but also if we held off on shipping anything until it was solved we wouldn't have anything to show for it.

Even if it were implemented immediately, it can't yet be consumed by cargo audit, so is there really a good reason to make it a blocker for the implementation of this feature?

At any rate, it's no worse than the current state of affairs with Cargo.lock auditing using cargo audit.

@epage
Copy link
Contributor

epage commented Oct 5, 2023

Maybe put a different way, this solves one problem (identifying vulnerabilities in released binaries that lack a chain of custody) but, for the limited bandwidth of the cargo team, we are interested in putting our efforts in solving a different problem (SBOMs). They are related (cargo-audit has a subset of the needed dependency information). Independent of our general focus, there is the question of designing the manifest configuration for this and I suspect looking at SBOMs first has us looking at the bigger picture and will help us have more of the information we need for the manifest configuration.

@tarcieri
Copy link

tarcieri commented Oct 5, 2023

But if this feature used an SBOM format like CycloneDX, it could be an incremental first step towards full SBOM support that itself is independently useful for vulnerability auditing.

Someone else just needs to, as an additive second step, fill in the missing gaps you've highlighted.

@Shnatsel
Copy link
Member Author

Shnatsel commented Oct 5, 2023

This RFC is deliberately scoped so narrowly that we wouldn't need to design manifest configuration beyond an on/off switch.

I still believe that SBOM generation is better solved outside Cargo. The SBOM formats are still evolving, as is the ecosystem around them. New use cases, requirements and regulations pop up all the time. This rapidly evolving landscape would conflict with Cargo's stability guarantees, and the limited reviewer bandwidth for a rapidly changing implementation would be a hindrance.

I expect to produce proof of this belief in the next couple of months in the form of cargo cyclonedx.

@Shnatsel
Copy link
Member Author

Shnatsel commented Oct 5, 2023

It is clear that the RFC needs major revisions. At least:

  • The details of how this is configured should be specified
  • The Cargo team has expressed a desire in having this off by default, at least initially, which should be reflected in the text
  • I want to experiment with using an industry standard format, CycloneDX, and see if compression will compensate for its verbosity
  • Interactions with other scanning approaches and ways to reduce false positives, such as parsing debug info, should be documented

So I am converting this PR to draft to clearly indicate its status.

@Shnatsel Shnatsel marked this pull request as draft October 5, 2023 14:11
@epage epage mentioned this pull request Jan 12, 2024
@epage
Copy link
Contributor

epage commented May 21, 2024

@rfcbot postpone

I propose we finish up and stabilize #3553 before evaluating next steps for tracking of artifacts.

@rfcbot
Copy link
Collaborator

rfcbot commented May 21, 2024

Team member @epage has proposed to postpone this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. disposition-postpone This RFC is in PFCP or FCP with a disposition to postpone it. labels May 21, 2024
@rfcbot rfcbot added final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. and removed proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. labels Jun 11, 2024
@rfcbot
Copy link
Collaborator

rfcbot commented Jun 11, 2024

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this RFC. and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Jun 21, 2024
@rfcbot
Copy link
Collaborator

rfcbot commented Jun 21, 2024

The final comment period, with a disposition to postpone, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This is now postponed.

@rfcbot rfcbot added to-announce postponed RFCs that have been postponed and may be revisited at a later time. and removed disposition-postpone This RFC is in PFCP or FCP with a disposition to postpone it. labels Jun 21, 2024
@rfcbot rfcbot closed this Jun 21, 2024
@Shnatsel
Copy link
Member Author

Note to future self: configuration not being sufficiently defined was brought up as a blocker for this RFC.

rust-lang/cargo#14222 has documented the configuration considerations for Cargo. This document should be used as a starting point later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dependencies Proposals relating to dependencies. A-security Security related proposals & ideas finished-final-comment-period The final comment period is finished for this RFC. postponed RFCs that have been postponed and may be revisited at a later time. T-cargo Relevant to the Cargo team, which will review and decide on the RFC. to-announce
Projects
Archived in project
Status: Unreviewed
Development

Successfully merging this pull request may close these issues.