Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify str::from_utf8_unchecked's invariants #95895

Merged
merged 1 commit into from
Apr 12, 2022
Merged

Conversation

CAD97
Copy link
Contributor

@CAD97 CAD97 commented Apr 10, 2022

Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that &strs are valid UTF-8."

This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.

If user code wants to create an unsafe &str pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.

Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement

Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8."

This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.

If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.
@rust-highfive
Copy link
Collaborator

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 10, 2022
@Dylan-DPC
Copy link
Member

@bors r+ rollup

@bors
Copy link
Contributor

bors commented Apr 11, 2022

📌 Commit b92cd1a has been approved by Dylan-DPC

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 11, 2022
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Apr 11, 2022
Clarify str::from_utf8_unchecked's invariants

Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8."

This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.

If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.

Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Apr 11, 2022
Clarify str::from_utf8_unchecked's invariants

Specifically, make it clear that it is immediately UB to pass ill-formed UTF-8 into the function. The previous wording left space to interpret that the UB only occurred when calling another function, which "assumes that `&str`s are valid UTF-8."

This does not change whether str being UTF-8 is a safety or a validity invariant. (As per previous discussion, it is a safety invariant, not a validity invariant.) It just makes it clear that valid UTF-8 is a precondition of str::from_utf8_unchecked, and that emitting an Abstract Machine fault (e.g. UB or a sanitizer error) on invalid UTF-8 is a valid thing to do.

If user code wants to create an unsafe `&str` pointing to ill-formed UTF-8, it must be done via transmutes. Also, just, don't.

Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/str.3A.3Afrom_utf8_unchecked.20Safety.20requirement
This was referenced Apr 11, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 11, 2022
Rollup of 7 pull requests

Successful merges:

 - rust-lang#95008 ([`let_chains`] Forbid `let` inside parentheses)
 - rust-lang#95801 (Replace RwLock by a futex based one on Linux)
 - rust-lang#95864 (Fix miscompilation of inline assembly with outputs in cases where we emit an invoke instead of call instruction.)
 - rust-lang#95894 (Fix formatting error in pin.rs docs)
 - rust-lang#95895 (Clarify str::from_utf8_unchecked's invariants)
 - rust-lang#95901 (Remove duplicate aliases for `check codegen_{cranelift,gcc}` and fix `build codegen_gcc`)
 - rust-lang#95927 (CI: do not compile libcore twice when performing LLVM PGO)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit ae6f75a into rust-lang:master Apr 12, 2022
@rustbot rustbot added this to the 1.62.0 milestone Apr 12, 2022
/// This function is unsafe because it does not check that the bytes passed to
/// it are valid UTF-8. If this constraint is violated, undefined behavior
/// results, as the rest of Rust assumes that [`&str`]s are valid UTF-8.
///
/// [`&str`]: str
/// The bytes passed in must be valid UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly right: A &str is allowed to contain invalid utf-8, but other functions might assume that a &str is valid utf-8, making a non-utf-8 &str very hard to safely use. But just calling this function with invalid utf-8 is, by itself, not unsafe.

See also e.g. https://doc.rust-lang.org/stable/std/string/struct.String.html#method.from_utf8_unchecked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but this is perhaps even handled just by if the validity of &T is independent from whether the pointee bytes are valid at type T.

In general, from_unchecked functions do just say that it is UB to provide an argument which does not satisfy the safety invariant. This is useful, because it allows a sanitizing implementation of the function which checks the precondition.

A postcondition of "is not used in any way which causes UB" is much more difficult to reason about.

Another point is that str has the option of using &*(v as *const [u8] as *const str) to construct a &str to invalid-UTF-8. String doesn't have any such API, relying on conversion to/from Vec<u8>.

If any str/String methods actually documented that they were safe to call with a reference to invalid UTF-8, then this weaker documentation requirement makes sense. As is, the only possible thing to do with a str/String to invalid UTF-8 is to forget it. With that the case, the clearer precondition seems better to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants