Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provenance #52

Closed
asajeffrey opened this issue Nov 29, 2018 · 23 comments
Closed

Provenance #52

asajeffrey opened this issue Nov 29, 2018 · 23 comments
Labels
A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow)

Comments

@asajeffrey
Copy link

Do we need to discuss C's provenance model?

@asajeffrey
Copy link
Author

In particular, are we OK with having code which has defined behaviour in Rust but is UB in C?

For example, in C IIRC, foo() is UB:

  fn foo() {
    let x = false;
    let y = true;
    if (&x as usize + 1) == (&y as usize) {
      let z: &[bool; 2] = unsafe { &*(&x as *const bool as *const [bool;2]); }
      bar(z);
    }
  }

If Rust doesn't track provenance, then it'll be safe in Rust (possibly unless bar is a C FFI function).

@RalfJung
Copy link
Member

RalfJung commented Nov 29, 2018

are we OK with having code which has defined behaviour in Rust but is UB in C?

Yes! C has type-based alias analysis, we do not.


However, I think we will have provenance. I tried defining a model for Rust's references without provenance last year, and it wasn't very compatible with how unsafe code treats references. Moreover, I have since then come to believe that having a model where lifetimes are irrelevant for whether a program has UB is even more important, and I do not know how to define a model that both ignores lifetimes and has no provenance.

That said, I strongly believe that pointers should be the only values that have provenance. In particular, integers do not have provenance. (And no, pointers are not just integers.)

I think foo is UB in Rust, and it would be UB in even the "least-provenance" model I can imagine. The fact that you cannot do pointer arithmetic from one allocation to another is fairly fundamental, and we'd sacrifice any hope of ever being compatible with LLVM if we decided anything else. We'd also probably sacrifice a lot of performance, though unfortunately I know of no analysis that shows how much performance which of C's restrictions on pointers actually gains.
If you go through an integer, things are different -- and you quickly run into LLVM bugs.

@asajeffrey
Copy link
Author

@RalfJung do raw pointers carry provenance, or just references? (So, is ref-to-raw-ptr-to-ref a no-op, or is raw-ptr-to-int-to-raw-ptr?)

@hanna-kruppe
Copy link

We'd also probably sacrifice a lot of performance, though unfortunately I know of no analysis that shows how much performance which of C's restrictions on pointers actually gains.

It is always difficult to fairly to evaluate how much optimization power a given UB actually enables and which concrete instances of optimizations could be salvage in specific circumstances by a different (often more involved) analysis, especially when you put no upper bound on the amount of engineering you're willing to invest into that. However, this particular UB is so fundamental to alias analysis of C-like languages that I have very little hope of ever encountering a real world optimizer that doesn't have this assumption (or an even stronger one, if it doesn't support C in the first place) hard-coded. Even when there's some escape hatch, optimizations will not be built to handle it as gracefully as possible, so even independent of LLVM, trying to allow cute little programs like this will effectively kneecap any optimizers.

So there needs to be really, really, really good rationale for allowing this sort of shenanigans. By my current knowledge, even allowing reading of uninitalized variables has better justification (in that there's at least a genuinely-novel and potentially-useful data structure that relies on it).

@RalfJung
Copy link
Member

@asajeffrey That's a hard question. Raw pointers certainly carry some provenance about which allocation they point to, i.e., your example is UB even with raw pointers. Beyond that, Stacked Borrows currently does not assign them any provenance, but that is known to not match LLVM's noalias. The path forward here is not clear yet.

@asajeffrey
Copy link
Author

Is there a way to use the stacked borrows model to make pointers just integers by keeping the provenance information in the memory rather than the pointer value, then on converting a *T to a &T checking that there is an appropriate borrow, and that the T value being pointed to all has the same provenance? This would deal with the example from the blog post:

  let mut x = false;
  let mut y = false;
  if (&x as usize + 1) == (&y as usize) {
    let z: *mut bool = (&mut x as *mut bool as usize + 1) as *mut bool;
    let z: &mut bool = unsafe { &mut *z };
    *z = true;
  }
  y // can be optimized to false

since there's an implicit freezing of y, so the unsafe block is UB since there is no "appropriate" borrow of &mut y.

Er, lots of hand-waving there about what an "appropriate" borrow is.

@RalfJung
Copy link
Member

Stacked Borrows inherently makes it so that two pointers to the same location are not equivalent to use:

let x = &mut 0;
let y = &mut *(x as *mut _);
*y = 4; // allowed
*x = 2; // allowed, but invalidates y
*y = 1; // NOT ALLOWED

A model without provenance inherently cannot distinguish x and y. Thus, I think the fundamental goal of Stacked Borrows to be a dynamic version of the borrow checker cannot be achieved without provenance.

@asajeffrey
Copy link
Author

Hmm, so values of type &T have to carry some metadata about lifetimes, Do they need provenance as well though? Could we have a model where usize and *T are just bitstrings, and &T contains additional lifetime metadata?

@asajeffrey
Copy link
Author

A variant example:

  let mut p = (false, false);
  let x = &mut p.0;
  let z = (&x as *mut bool as usize + 1) as *mut bool;
  let z = unsafe { &mut *z };
  *z = true;
  &p.1; // can be optimized to false???

I think this optimization is allowed, because I think this program is UB.

@RalfJung
Copy link
Member

I don't understand the distinction between provenance and "lifetime metadata". Also, people transmute between references, pointers and usize and expect that to work.

What happens, basically, is that we have two kinds of values: bitstrings, and pointers. Pointers carry provenance, meaning there are pointers that "point to the same thing" and yet are different. This is part of the abstract machine on which Rust runs and completely independent of any type system. We can then ask which effects various operations such as casting and loading from/storing to memory can have on each of these values. This is a long and complicated discussion that you can write entire papers about and still be far from done.

In Stacked Borrows, I am trying to restrict provenance in raw pointers to a minimum. For example, casting ref-to-raw will "strip" the tag that Stacked Borrows uses to identify where a reference comes from. (However, it will still track which allocation the pointer points to. I see no possibility of allowing cross-allocation reasoning even for raw pointers. I also see very little motivation for that.) However, you can still transmute a reference to a raw pointer and then use that, and we have to say what happens (and people don't like it when we make that UB).

@RalfJung
Copy link
Member

I think this optimization is allowed, because I think this program is UB.

I agree, and so does miri (select "Tools - Miri" to see it report this as UB).

@asajeffrey
Copy link
Author

@RalfJung ah, we may be using the word "provenance" differently. I'm meaning the C defn. As far as I'm aware the best discussion of it is https://www.cl.cam.ac.uk/~km569/exploring_provenance.pdf

When you say "pointers" you are meaning references and raw pointers? You are arguing that raw pointers need to carry metadata?

@RalfJung
Copy link
Member

RalfJung commented Dec 2, 2018

we may be using the word "provenance" differently.

I don't think we do.

When you say "pointers" you are meaning references and raw pointers?

I was talking about values in the abstract machines. At this point, there are no types. Just like the bitstring 0xE6 might be u8 or i8 or one byte of some other type, a pointer value can appear in variables of various types.

The odd thing to get used to here is that a bit in this machine cannot be just 0 or 1, but also uninitialized or "the n-th bit of some abstract pointer value". See this blog post for further details on that stuff (that post handles things byte-per-byte instead of bit-per-bit, but that's not a fundamental difference).

We can then ask which values of the abstract machine can occur for variables of which types. For example, we would like 0x2 to never occur for type bool. That's the "validity invariant" -- which values of the abstract machine are valid for which type.

So when I say "pointers" I mean all pointer values, no matter the type they occur at.

You are arguing that raw pointers need to carry metadata?

Yes. They need to carry the allocation they refer to, at the least (to rule out cross-allocation arithmetic). This is well-established even in Rust.

@Gankra
Copy link

Gankra commented Feb 4, 2019

Not 100% sure if this is the right issue, but wanted to toss this out here:

CHERI reported compatibility issues related to provenance when porting C software to its platform. (See section 5.3)

We have largely ignored CHERI in the design of the rust platform on the assumption that it's vaporware, but it seems that ARM has just announced that they are implementing native CHERI support.

Note also that CHERI requires size_t and intptr_t be different (as discussed in the paper), as intptr_t is 128-bit and size_t is still 64-bit (iiuc). Also intptr_t needs to be codegen'd as if it is a pointer in order to maintain the hardware-level integrity of the tag bits it stores.

@Amanieu
Copy link
Member

Amanieu commented Feb 4, 2019

@gankro

I suspect that if ARM does add CHERI support, they will do it without changing the size of a pointer, instead using something similar to their memory tagging extension which uses the top 8 bits of a pointer that are ignored by the hardware.

Another thing to look at is hwasan which also uses the top 8 bits of a pointer on AArch64 to implement a faster version of ASAN. There might be implications for code generation since the pointer tags will need to be preserved.

@RalfJung
Copy link
Member

FWIW, the glossary now contains a definition of provenance.

@RalfJung RalfJung added A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow) and removed T-aliasing labels Aug 14, 2019
@jrtc27
Copy link

jrtc27 commented Mar 10, 2021

@gankro

I suspect that if ARM does add CHERI support, they will do it without changing the size of a pointer, instead using something similar to their memory tagging extension which uses the top 8 bits of a pointer that are ignored by the hardware.

That is not, and never can be CHERI. 8 bits is nowhere near enough to encode all of:

  • fine-grained bounds
  • fine-grained permissions
  • sealing, a form of opaque reference for constrained domain transitions, critical for interaction between compartments

CHERI is always going to be using capabilities that are two times the native word size (there was an uncompressed 256-bit version originally, but that is clearly not practical). Fitting everything into 128 bits is already a significant technical accomplishment. Plus MTE potentially has other uses in a world with CHERI in order to accelerate temporal safety schemes.

@nico-abram
Copy link

I found a C WG14 draft N2676 that discusses provenance (And N2624 "Introduction for discussion of N2577" (N2577 is N2676 without a small patch) ). I'm sharing it here since it seems relevant/interesting prior art. At one point, to support these desireable behaviours:
imagen
It defines some possible models:
imagen

Is the "copying pointer values with memcpy" usecase similar/equivalent to transmuting pointers/references?

@jrtc27
Copy link

jrtc27 commented Aug 10, 2021

A couple of notes:

  1. In CHERI C, much of that becomes unnecessary; intptr_t is still a real capability type, not a plain integer (though we let you manipulate it as if it were one, with certain constraints if it is an actual valid capability). The need for some of the changes to C stem from intptr_t being typedef'ed to a plain integer type so you have much weaker guarantees about what software could be doing with them, whereas in our model the provenance is tracked in the hardware and maintained throughout and we know that any cast to a plain integer type will not result in a valid pointer when cast back. This does render several of the examples broken by design, but supporting them would require a lot of complexity for code that should not exist (e.g. if you printf then scanf a pointer you will just get back the integer address and be unable to dereference it).
  2. The "copying pointer values with memcpy" is not related; that is just to say that:
struct foo {
    void *p;
};
void copy_foos(struct foo *dst, const struct foo *src, size_t n) {
    memcpy(dst, src, n * sizeof(struct foo));
}

had better work and behave as if you had assigned rather than memcpy'd (see also the realloc example, which is basically the same thing, except that the memcpy is an implicit part of the realloc interface when it has to move your allocation).

@CAD97
Copy link

CAD97 commented Mar 20, 2022

I have no idea where to properly note this, but I wanted to note it down:

There's at least some desire for f64 to be able to hold provenance information, so that NaN-boxing can potentially be seen through by the optimizer.

Personally, my opinion is just "yikes." By this time I'm firmly on the side that traditional ptrtoint needs to leak provenance such that act of god divining the address elsewhere in the process is valid. (But hopefully there's a way for the compiler to ptrtoint without leaking provenance, so you can e.g. display an address without considering the provenance leaked.)

@RalfJung
Copy link
Member

my opinion is just "yikes."

My opinion is "no way, if you do crazy things like that you get to keep the pieces".^^

It'd be much more sane to use something like

union MyJavaScriptThing { float: f64, ptr: *mut () }

and then use provenance-preserving operations to manipulate the bit pattern of ptr such that it is a NaN.

@thomcc
Copy link
Member

thomcc commented Mar 24, 2022

You can implement Nanboxing using a *mut () (or NonNull<()> if your encoding avoids all-bits-zero) on 64 bit. On 32 bit you'd need the union approach. This is true for most nanboxing schemes (and I've never seen one that actually passed f64 around, they almost always either use an integer (possibly wrapped in a struct), or a union).

This is one of the cases I'm thinking of when I've commented that ptr->int is often used to avoid duplicate code though.

@JakobDegen
Copy link
Contributor

Closing, provenance exists and the questions here are answered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow)
Projects
None yet
Development

No branches or pull requests

10 participants