Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing allocations is technically unsound in OOM (out-of-memory) situations #328

Open
RalfJung opened this issue Apr 14, 2022 · 88 comments

Comments

@RalfJung
Copy link
Member

The following program might be considered to be guaranteed to run out of memory, but LLVM actually compiles it in a way that it does not:

int *x = malloc(SIZE_MAX);
int *y = malloc(SIZE_MAX);
if (x && y) {
  printf("This is unreachable. I could deref a NULL pointer and this program would still be fine.");
}

It is not clear how to formally justify this.

@JakobDegen also has an example for recursion exhausting memory via stack allocations and being similarly optimized away.

@JakobDegen
Copy link
Contributor

JakobDegen commented Apr 14, 2022

I've since improved upon the example:

fn bad(limit: usize) {
    if limit == 0 {
        return;
    }
    let mut x: usize = limit;
    let r = &mut x;
    bad(limit - 1);
    if *r != limit {
        panic!();
    }
    return;
}

fn main() {
    bad(usize::MAX);
    println!("unreachable");
}

The interesting things about it are that 1) it never does anything to inspect the values of the pointers and 2) all of the values in the locals are genuinely live when the recursion bottoms out, so no amount of liveness analysis will get us out of this.

@JakobDegen
Copy link
Contributor

JakobDegen commented Apr 14, 2022

My original idea for a way to solve this was to not require that addresses of memory in different allocations be unique in general, but to magically guarantee that this is indeed the case for any two allocations that have the property of both being live and having at least one exposed (via ptr.expose_addr()) item in their borrow stacks. That runs into this cursed example though:

let p1 = alloca(8);
let p2 = alloca(8);

if abs_diff(p1.addr(), p2.addr()) < 8 {
    // Dead code, but removing it is unsound
    p1.expose_addr();
    p2.expose_addr();
    println!("Unreachable");
}

if abs_diff(p1.addr(), p2.addr()) < 8 {
    println!("Unreachable");
}

dealloc(p1);
dealloc(p2);

This means that even if we are willing to give up unconditional address uniqueness (which I think we have to given the example above), it is still entirely unclear how to make this work

@JakobDegen
Copy link
Contributor

I've written up a proposal for a fix here: https://hackmd.io/@2S4Crel_Q9OwC_vamlwXmw/HkQaMnB49

@RalfJung
Copy link
Member Author

At a high level, it seems like that proposal moves the OOM to when expose_addr is done? That's when OOM errors occur, and hence removing an expose_addr now becomes impossible for the same reason that removing OOM-causing operations was impossible previously.

So, doesn't this example still have the same problem?

let x = malloc(isize::MAX);
let y = malloc(isize::MAX);
let z = malloc(isize::MAX);

if x.expose_addr() != 0 && y.expose_addr() != 0 && z.expose_addr() != 0 {
  unreachable!();
}

free(x);
free(y);
free(z);

It should be possible to remove this expose_addr, since we can see the allocation being removed, so no valid code can ever pick this provenance in a future from_exposed_addr.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 17, 2022

(After talking with @JakobDegen on Zulip.)

Indeed this doesn't entirely solve the problem of removing dead allocations -- one can still not remove dead allocations that are being exposed. However, that sounds like a big step forwards! Most allocations are not being exposed so removing them can be justified by this model. (And if we keep something like AllocId, we don't need many different "domains" either, just a single per-allocation boolean indicating whether it is part of the "bottom" domain.)

Basically, if we can carry the AllocId around reliably, we don't need "non-overlapping addresses" to figure out which allocation an access is intended to access.

The downside is that expose_addr becomes even more of an optimization barrier. Previously there were some conditions where it could be removed, that seems a lot harder now.

What remains somewhat fuzzy to me is that the address gets picked when the allocation is created, but if it gets exposed then that address may later fail to be disjoint and the program is doomed for OOM. Basically, it seems impossible for a programmer to prove, under this model, that their program does not go OOM. However, these kinds of models usually have to allow any allocation to non-deterministically OOM anyway, so... that doesn't really make things any worse, I think? Certainly some use-cases do need to prove absence of OOM, but they need a different model to begin with. I wonder if there are good ways to relate the two...


Now I wonder: what goes wrong if we just remove the "addresses must be disjoint" part entirely, without replacing it by some condition on expose_addr like @JakobDegen proposed? In from_exposed_addr we just non-deterministically guess which allocation is meant, as before. Shouldn't that still work?

@JakobDegen
Copy link
Contributor

JakobDegen commented Apr 17, 2022

My third example was intended to show that this is potentially problematic:

// May these get overlapping addresses?
let p = alloca(8); *p = 0;
let q = alloca(8); *q = 0;

let a = p.expose_addr();
q.expose_addr();

let x = a as *mut usize;
*x = 5;
dbg!((*p, *q));

Legal behaviors for this program that users definitely expect would seem to be for it to OOM, or for it to print (5, 0). I believe your suggestion also allows it to print (0, 5), which seems extremely surprising (or even (0, 0) if there is any other exposed pointer anywhere else in the AM)

@RalfJung
Copy link
Member Author

That's a fair point. Basically this boils down to, the angelic choice that is magically made should only be allowed to affect UB, not observable behavior.

@RalfJung
Copy link
Member Author

I think I just realized there is a problem with this scheme: it breaks code like PartialEq for slices that has a fast-path for when both slices have the same address and length. Basically that would have to use expose_addr rather than addr, which (a) nobody will get right, and (b) makes optimizations hard, somewhat defeating the point of the fast-path.

@JakobDegen
Copy link
Contributor

I think there are even more problems than that. For example, how does this work with FFI: We don't guarantee that this code

extern "C" { fn foo(a: &u8, b: &u8) }

let x = 0;
let y = 0;
foo(&x, &y);

will pass distinct pointers to FFI. This was not a concern with other situations, since in those cases the compiler had to assume that FFI code does the worst crimes it possibly could, but here the compiler has no such concerns. This can be fixed by saying something akin to: On FFI calls, we find all the pointers reachable from the arguments/globals and .expose_addr() them; this ensures that OOMs happen before the call, and so if the call succeeds everything has disjoint allocations.

This is ugly though, and I would absolutely love to have a model that does it better. I just don't know one

@RalfJung
Copy link
Member Author

I don't see why things would be any different for an FFI function than for a Rust function.

@JakobDegen
Copy link
Contributor

Oh, yeah, nevermind. This probably doesn't work then, which means we probably have to scrap a lot of this model. Really I think the most important thing we need is some way to address the annoying circular reasoning you can do in Example 2. I don't know how to go about it though.

@JakobDegen
Copy link
Contributor

I thought about this some more, and can make the issue somewhat more concrete. For slightly weird reasons, I do think this is actually an issue specifically with external functions (such as FFI). Consider

fn bar() {
    foo(&1);
}

#[inline(never)]
fn foo(a: &i32) {
    a.expose_addr();
}

A naive implementation of the Rust AM under my model would require us to include a check in foo to ensure that the addresses of a does not overlap with any other exposed allocation. Obviously this is borderline impossible to practically implement. However, a compiler could be smarter. It could instead do things in two stages:

  1. In the first stage (roughly resembling MIR) that check is implicit as a part of the .expose_addr(). We don't worry about how to implement this in actual hardware, we just pretend we can. This allows us to do all the optimizations that are enabled by this memory model.
  2. In the second stage (roughly corresponding to asm) that check is no longer implicit. However, now our IR instead includes a promise that all allocations are created in the exposed address domain. This means that when we lower to this IR, we no longer have any need to include the check - it would come back false anyway. Of course, by doing this, we now once more have only finite memory and in general removing allocations at this stage is unsound - but that's ok, we've already gotten a chance to do our optimizations.

This kind of model almost works, except when you go and make foo pub. Now you're in trouble, because you can't expect that whatever will call foo, which may be arbitrary Rust code, will uphold the expectation of the lowered IR. In other words, when all the code we cared about was still under the control of this compilation session, our compiler could do smart tricks to avoid having to implement the check. Now that that is no longer the case, we're in trouble.

I don't see a way to fix this. As I said above, at this point I'm mostly just trying to think of a brand new idea.

@digama0
Copy link

digama0 commented Apr 24, 2022

I think this is why I was saying that it needs to be a compiler obligation that the OOM case of expose_addr is never actually codegenned. In the case of foo, I believe this means that we would need to mark the function as potentially-exposing (or actually-exposing if the body is visible), and in general for FFI calls we can't know whether addresses are exposed and must assume that they are.

We can therefore lift the exposure into bar, i.e. replace it with let x = &1; x.expose_addr(); foo(x). This is pessimistic since we might be prematurely exposing x and also the exposure could be conditional in foo if it is opaque, but since we are permitted to spuriously OOM, this is a legal transformation.

You can therefore optimize bar using the same two-stage approach - it's just that all calls to previously optimized and opaque functions are treated as exposing everything that is reachable. In short, I don't think FFI causes any additional issues.

@JakobDegen
Copy link
Contributor

You could certainly have such a requirement for your compiler, but I don't see how we can in general enforce this requirement on other compilers - if we try and write a rule that says they must obey it, then we essentially once more end up with this being a part of the AM, which has problems as we've seen

@digama0
Copy link

digama0 commented Apr 24, 2022

I'm not sure I follow. The model has all code which is linked together cooperating under the rules of the Rust AM, which is perhaps a little self-centered but should be fine. Different compilers communicating across FFI just need to make pessimistic assumptions about the other side - in this case, that FFI calls require all allocations to already be exposed when they are passed in, meaning that the caller should call expose_addr if necessary first. These calls are essentially fictional, in that they will be optimized away during your stage-2 compilation. For compilers that assume the allocator always gives disjoint ranges, they are already in stage-2 from the get-go so there is no special handling needed.

@JakobDegen
Copy link
Contributor

Different compilers communicating across FFI just need to make pessimistic assumptions about the other side

In what sense is this a pessimistic assumption? Of course the compiler has to assume that the other side may emit .expose_addr(), but under this model that is not it's problem. Now you suggest adding a rule that says that the OOM on .expose_addr() must never happen, but making this promise to the user is imo not an option.

Now if we instead make a rule that says "you must .expose_addr() before FFI calls on everything reachable from that call" we do turn the .expose_addr() into our responsibility and prevent the other side from having to possibly codegen the OOM on .expose_addr(). The problem with this rule is this though: When you say "FFI," I assume you mean "across compilation units," since that is as far as the compiler can see. Unfortunately, this means that if we inline any code across compilation units (which rustc does today even in non-xLTO situations) we now have to add an .expose_addr() in the middle, which is obviously bad.

@digama0
Copy link

digama0 commented Apr 24, 2022

I mean the second thing: "you (the downstream compiler) must .expose_addr() before opaque function calls", i.e. the extern function has an implicit precondition that it receives only exposed allocations, which justifies the upstream compiler's decision to not emit any OOM calls in the function if it does call .expose_addr(). (Obviously the intent here is that this obligation gets pushed all the way up so that no one needs any OOM calls at all and we have effectively achieved what you say in the first paragraph: the compiler has ensured that expose_addr never results in an OOM call.)

If you add the expose_addr in the middle and then optimize with e.g. LTO, then yes you will have to keep the expose. If you can inline in stage-1 though you can avoid it. I think a lot of inline functions in rustc are stage-1 compilation artifacts anyway (i.e. MIR) so they would not hit this issue.

@JakobDegen
Copy link
Contributor

I don't see how this works. When you talk about this sort of obligation, who is this obligation to, where is it written down, and in terms of what is it defined?

It seems like you're suggesting that this obligation is just in the language itself and is defined in the spec, but the spec has no notion of stage 1 or stage 2. The stages are a useful distinction when talking about the internal reasoning a compiler can do, I don't think we can add a rule in terms of them.

@digama0
Copy link

digama0 commented Apr 24, 2022

The AM would just have a rule about overall behavior, and passes right through any FFI calls. The compiler produces a number of compilation artifacts like .rlib files, and these can have additional compiler-specific invariants on them, as long as the final linked executable satisfies the AM rules. In this case, the obligation to expose allocations on stage 2 artifacts and not on stage 1 artifacts is a property of these compiler-specific intermediate files, as part of its implementation of linking pursuant to the rules of the AM.

The linking rules are still somewhat under the control of the compiler, since it is only the linked executable that has AM dynamic semantics. So for instance the exact repr(Rust) calling convention does not need to be part of the spec, even though it is important for linking .rlibs together. The compiler needs to ensure that when it links everything together, the composite artifact does not violate the AM rules, but that does not pose any problem to inlining across compilation units if it has stage 1 compilation artifacts which contain information about whether they assume their inputs are exposed in the compiler metadata.

@JakobDegen
Copy link
Contributor

I mean, this is how it works right now (more or less), but that's not really enough. For example, this would make it unsound for xLTO to remove dead locals.

More generally though, the problem is that it's not possible to implement the AM without imposing additional rules. Right now, in order for two Rust implementations to be able to link to each other, the only things they need to agree on are the AM parameters - layout, calling convention, things like that. But with this model in place, even with all AM parameters agreed upon, the side trying to compile foo is in trouble because it cannot actually implement foo with the full generality required by the AM. Instead, it has to implement a reduced version of foo and then nicely ask the thing it's linking to to please not do things that would make the non-compliant implementation observable. I don't think that's a direction we want to go in - the AM should be "trivially" implementable, without the need to impose additional global rules.

@comex
Copy link

comex commented Apr 26, 2022

I don't see the problem.

At the actual assembly level, "all addresses passed across [standard-calling-convention] function call boundaries must be exposed" is naturally true. After all, all objects that still exist at the assembly level must have unique addresses that don't overlap with other simultaneously-live objects.

Meanwhile, the LLVM IR level is at least one place where the compiler currently removes unused allocations. (I think it's currently the only place where it does so.) Assuming we want the compiler to continue to be able to perform that optimization, "all addresses passed across calls must be exposed" must be false.

So why not posit that, in the current implementation, LLVM IR is stage 1 and assembly is stage 2? Then the 'additional rule' only applies to assembly, where it's automatically satisfied. And LTO is no longer a special case, since that just involves LLVM's optimizer performing the same optimizations across compilation units that it normally performs within a single compilation unit, which are presumptively valid.

@JakobDegen
Copy link
Contributor

Let me try and present this a different way. Say that I am a Rust compiler looking to emit a static library corresponding to the following Rust crate:

pub extern "C" fn foo(a: &i32) {
    (a as *const i32).expose_addr();
}

Specifically, the goal is for me to emit code that has behavior identical to that of the abstract machine - that is what means to compile Rust after all. But that is not possible. In order to compile this, I must make additional requests on top of the rules associated with the language.

@digama0
Copy link

digama0 commented Apr 26, 2022

Such a rust compiler could emit two versions of the function:

  • A stage 1 artifact in an internal format, like an .rlib record for an #[inline] library function. This would be something like MIR, and would include explicit information suitable for interprocedural analysis of expose_addr calls. In this case, the call would be marked as "a must be exposed" because the function calls expose_addr on a, so it would effectively be part of the (internal and unstable) calling convention that a call to foo requires that a be exposed. The code itself does nothing.
  • A stage 2 artifact because it is an extern "C" function which can be linked to e.g. C compiler output. This has a stable and documented calling convention that requires that all passed parameters are exposed. The code itself does nothing.

Compare this to the following function:

pub extern "C" fn bar(a: &i32) {
    (a as *const i32).addr();
}
  • The stage 1 artifact now contains metadata saying that a does not need to be exposed before calling the function, which forms part of the (internal and unstable) calling convention. The code itself does nothing.
  • The stage 2 artifact is unchanged: a must still be exposed before calling the function in order to conform to the stable extern "C" calling convention.

Of course, the compiler can choose to only emit stage 2 artifacts if it wants, or to not include this extra bit of interprocedural analysis data in the stage 1 artifact and always require exposure. In either case the loss is that inlining across codegen units may result in unnecessary exposure of locals.

In fact, the stage 1 artifact need not contain an explicit annotation with the exposure analysis. If it is an #[inline] function, the compiler is free to just inline the whole thing and observe that no call to expose_addr is made, regardless of the actual calling convention. (Alternatively stated, the real calling convention of such an artifact is the "weakest precondition" of the contained code, and the type signature of the function is simply a sound underapproximation of this predicate. Inlining the code is making use of the weaker precondition directly.)

@digama0
Copy link

digama0 commented Apr 26, 2022

Regarding Ralf's point about PartialEq on slices, I think we already have the necessary API to do the right thing here: ptr::guaranteed_eq can be specced to always return false for addresses in different domains. That way if the slice pointers are compared using guaranteed_eq, if the compiler can prove they are from different domains it can fold it to false, if it can prove they are the same domain then it can fold it to p1.addr() == p2.addr(), and otherwise it can reduce it to p1.expose_addr() == p2.expose_addr(). In any case this is a valid refinement of the nondeterminism.

@JakobDegen
Copy link
Contributor

in order to conform to the stable extern "C" calling convention

Where is this rule coming from? If "extern "C' requires exposed pointers is just a language rule, then we can't inline the following without adding a .expose_addr()

extern "C" fn bar(a: &i32) -> i32 {
    *a
}

fn foo() {
    bar(&2);
}

In any case, I don't think I'm making my point clear here, so I'll probably stop arguing it. Maybe someone else can explain more elegantly than me.

@digama0
Copy link

digama0 commented Apr 26, 2022

(I'm using "calling convention" in a somewhat loose way to also include all the machine preconditions for soundness, not just data layout. I don't think we've miscommunicated on this point but it seems worth calling out.) Let me elaborate on the weakest precondition thing I mentioned in the previous post.

The weakest precondition of a piece of code C is the least you need to assume in order to ensure that the code does not have UB and some postcondition Q holds afterwards. This predicate is defined so as to make this property a tautology: If before running C we know property WP(C, Q) := "running C in this state will not cause UB and will result in a state satisfying Q", then of course we will not cause UB and result in a state satisfying Q. Normally we would assume something stronger and less tautological, like "these variables exist in memory at these locations and types", although compilers will generally always assume as a precondition that running the code will not lead to UB.

Relating this to the present case, if we have a non-opaque function (i.e. we have code for it), then there are essentially two ways call it: we can actually emit a call instruction, or we can inline the code. The minimum requirements we need to satisfy the AM execution we are after is the weakest precondition of the contained code, while the construction of the function as a callable thing has ensured that the function's calling convention is a stronger than this weakest precondition, so calling the function also satisfies the AM requirements as long as the calling convention is upheld in the call.

So if we find a call to an extern "C" function, then that means that we have to uphold the calling convention if we call the function, which means exposing the address. But importantly, exposing the address is not an operation that happens as a result of the call, it is only a convention between the caller and callee when they interface using an explicit call. If you don't do a call, the calling convention doesn't matter, the contained code matters, and if there isn't an expose in the inlined code then the obligation to expose the address disappears.

in order to conform to the stable extern "C" calling convention

Where is this rule coming from? If "extern "C' requires exposed pointers is just a language rule, then we can't inline the following without adding a .expose_addr()

This is all to say that "extern "C' requires exposed pointers" is a language rule, but it is also not the case that we must add an expose_addr in that example code. We only need to expose the address if we actually call bar as an extern C function, not if we inline it.

@RalfJung
Copy link
Member Author

RalfJung commented May 8, 2022

@digama0

Regarding Ralf's point about PartialEq on slices, I think we already have the necessary API to do the right thing here: ptr::guaranteed_eq can be specced to always return false for addresses in different domains.

I am not saying this cannot be fixed, but I think it is too much of a footgun.

@comex

Meanwhile, the LLVM IR level is at least one place where the compiler currently removes unused allocations. (I think it's currently the only place where it does so.)

Don't we ever remove unused MIR locals during MIR transformation? I would think we do. That is removing an unused allocation.

@JakobDegen

However, a compiler could be smarter. It could instead do things in two stages:

The systematic way to think about this is that the memory model changes when translating (for example) from LLVM IR to whatever the next IR is (I heard terms like SelectionDAG but I don't actually know the LLVM pipeline).

I don't quite see the FFI problem here -- when doing FFI you have to declare on which level you are doing the linking (are you linking two programs in LLVM IR, or in assembly), and the answers to various questions depend on that but it should all be coherent from what I can tell.

The most tricky bit is how exactly to formalize the condition that when you call expose_addr, the address must not conflict with any other exposed allocation. I think this needs some funky concepts like what this paper calls "no behavior" (NB), which in turn requires care while doing the proof when translating from this memory model to the one where allocations never overlap.

Basically:

  • when doing an allocation, non-deterministically (daemonically) pick an address; it can arbitrarily overlap with everything else (there also is always the option to fail with OOM, again via daemonic non-determinism)
  • in expose_addr, if this allocation overlaps with a previously exposed allocation, trigger NB

NB means it is the compiler's responsibility to prove that this never happens -- this is the exact opposite of UB. When arguing for correctness of the translation to a memory model without overlapping allocations, where usually we can assume that the source program has no UB, here we have to prove that the source program has no NB. But this is easy; in this translation we pick addresses to never overlap (we trigger OOM if they would overlap) so we are good.

@JakobDegen
Copy link
Contributor

JakobDegen commented May 9, 2022

Don't we ever remove unused MIR locals during MIR transformation? I would think we do. That is removing an unused allocation.

We definitely do, in SimplifyLocals.

Wrt everything else, @RalfJung and @digama0 : I'm still not getting my point across properly, since I agree with everything that you said but still believe there is a problem. So let me instead try getting closer to a point where I can explain what is going on by asking a question. Consider, these two snippets:

Version Easy

extern "C" {
    fn foo_a(i32) -> i32;
}

pub fn bar_a() {
    foo_a(5);
}

Version Hard

extern "C" {
    fn foo_b(*mut i32);
}

pub fn bar_b() {
    let mut x = 5;
    let p: *mut i32 = &mut x as *mut _;
    foo_b(p);
}

There has already been quite a bit of discussion about FFI that asked questions of the form "what are the obligations that foo_a and foo_b must uphold?" The answer to this has been that - to the extend observable by the AM - they must have the behavior of some non-UB Rust function. That's all well and good, and I am happy with that requirement.

I'm instead talking about "what are the obligations that the Rust implementation must uphold wrt these function calls?" This is tricky, because we cannot just do this define this "within the AM" the way we define what let a = 5 + 10; does. After all, the extern call very much has to "leave the AM," because it interacts with non-abstract things on the hardware/OS (which we must now acknowledge exists).

How does the language around the answer to the previous question look in the spec? Below I'm going to describe a way that I imagine this could be defined, but you are free (and encouraged!) to completely ignore my description if it makes answering this question easier. To emphasize: I am by no means advocating for what I wrote below. I am only trying to explain the model that I have in my head.

How I thought this could work

One way I believe this could be defined is by defining a map of AM states to what I'll call "hardware states." The best way I think is to explain this by giving an example of the language that defines the behavior of the foo_a(5) function call. It might look something like this:

When evaluating a call, if the target of that function call has the type of a function declared in an extern block, then, for each argument value:

  1. We find the "location" where that value should be placed, as required by the ABI - this is either a register, or some location in hardware memory.
  2. At this location, we perform a store of the argument value, mapping AM bytes to hardware bytes as follows: Int(val) => val, PtrFragment(val, tag) => val, Uninit => unspecified.
  3. We call the extern "C" function, again as specified by the ABI.

This is well and good for foo_a(5), but it is insufficient for foo_b(p). That's because we additionally need to guarantee that if the extern "C" function goes and dereferences the pointer it receives, it finds 5. The easiest and most aggressive way we could fix this is by adding a clause like the following:

4a. Additionally, for every currently live AM allocation, the contents of that allocation are stored to the hardware memory with the same address.

This is enough to guarantee that dereferencing the pointer has the desired behavior. Unfortunately though, it brings back the problem of "removing unused allocations is unsound," because now this is being miscompiled:

Snippet D

fn bad(limit: usize) {
    if limit == 0 {
        println!("Bottomed out"); // Pretend it's just an `extern` function
        return;
    }
    let mut x: usize = limit;
    let r = &mut x;
    bad(limit - 1);
    if *r != limit {
        panic!();
    }
    return;
}

fn main() {
    bad(usize::MAX);
    println!("unreachable");
}

The cause of this is obviously that the phrase "for every currently live AM allocation" is too broad; most of these allocations can't be accessed by the extern function after all. So we could instead use a clause like this:

4b. For every particular byte in AM memory: If there exists an execution of the extern function which could read this byte, it is written to hardware memory at the same address.

This kind of clause is what I meant above by

On FFI calls, we find all the pointers reachable from the arguments/globals and .expose_addr() them

This recovers the soundness of the optimization in Snippet D, because the allocations do not have to be spilled and so we do not have to deal with finite memory.

However, now that I've written all this out, this last version actually doesn't sound so bad. It's mostly workable - extern "Rust" is not affected as long as we don't ever give it a stable calling convention (and so can just say that the call happens in the AM, even across compilation units). The only critical point that I see with it is that with this version, removing allocations during LTO is unsound. That seems unfortunate, but maybe workable?

@RalfJung
Copy link
Member Author

I assume a dead allocation remains in the allocation map but gets marked as "dead".

@sollyucko
Copy link

sollyucko commented May 23, 2022 via email

@JakobDegen
Copy link
Contributor

Running out of memory is possible and even easy in safe code, so we can't just make it UB. And even if we could, I don't think we'd want to, as it's far too easy to do accidentally. Accidental infinite loops, which were before just logic bugs that were likely to cause OOMs, are now full UB

@Diggsey
Copy link

Diggsey commented May 23, 2022

This comment is assuming the title of this issue is still an accurate description of the problem, so if things have moved on then ignore me.

The following program might be considered to be guaranteed to run out of memory, but LLVM actually compiles it in a way that it does not:

Think about how integer overflow is handled: we guarantee that either there is a panic or the result is wrapped, but we don't specify which (at least, not at the language level).

In the same way, can't we just say that "running out of memory" will either report OOM or succeed, but not specify which. Of course the compiler is only allowed to make it succeed if all other constraints are still upheld.

@JakobDegen
Copy link
Contributor

JakobDegen commented May 23, 2022

I assume a dead allocation remains in the allocation map but gets marked as "dead".

Yeah, this is the version of things that I settled on as well. Allocations don't disappear when they are deallocated, they just get a switch flipped that says they're dead. At allocation time, forced allocations are only guaranteed to be non-overlapping with other live forced allocations. Importantly, .force_addr() still works after the allocation is dead. This is necessary for optimizations to ensure that

p.force_addr();
foo(); // non-diverging

can be re-ordered, as foo() might free(p).

The other minor detail then is that in the code for Option 3 above, p_alloc.forced needs to be true if p is an invalid pointer. So really, p_alloc and q_alloc should be Option<AllocId> and p_alloc.forced should read more like p_alloc.map(|alloc| alloc.forced).unwrap_or(true). But this all seems fine, and I don't think it causes problems.

We might get an equality for two different pointers with the same address but different allocations as long as both are forced... that means either they are ZST or one of them is out-of-bounds.

Note that this is exactly the same as what happens in a flat memory model.

Speaking a little more generally, the intuition that I have for what we're is saying something like this: If the compiler removes the allocation at compile time, then it can pretend that the allocation is non-overlapping with all others - both for the purpose of dodging OOMs and also for the purpose of pointer comparison.

This kind of approach should generally agree with people's intuition - "if I don't ask where a pointer is, then it's non-overlapping with other pointers" vaguely makes sense and so functions like fast_eq will generally end up doing The Right Thing. Furthermore, if we imagine making a pointer be a pair of the address and something vaguely like the address domains (usize, Forced | NonForced(Natural)) then equality comparison among pointers is just structural equality and so transitivity and other properties all work out.

@JakobDegen
Copy link
Contributor

JakobDegen commented May 23, 2022

This comment is assuming the title of this issue is still an accurate description of the problem, so if things have moved on then ignore me.

It is.

In the same way, can't we just say that "running out of memory" will either report OOM or succeed, but not specify which. Of course the compiler is only allowed to make it succeed if all other constraints are still upheld.

If one of the constraints that needs to be upheld is "allocations are non-overlapping" then the stack smashing program is a miscompilation and that's really bad. But if it's not a miscompilation, then we somehow have to explain what it means to have overlapping allocations in the AM and... well, that's what the rest of this thread has been about

@RalfJung
Copy link
Member Author

We might get an equality for two different pointers with the same address but different allocations as long as both are forced... that means either they are ZST or one of them is out-of-bounds.

Note that this is exactly the same as what happens in a flat memory model.

Ah, your == is stronger than what one would expect, so indeed the place to look for problems is !=. Sometimes pointers will be inequal even if they have the same address, and that is odd. It means we cannot treat ptr comparison as equivalent to comparing ptr1.addr() == ptr2.addr(), which people will probably expect to be equivalent. (I certainly did.)

OTOH the weirdness will only affect OOB pointers, which is a good sign.

If the compiler removes the allocation at compile time, then it can pretend that the allocation is non-overlapping with all others - both for the purpose of dodging OOMs and also for the purpose of pointer comparison.

Hm yeah that makes sense.

if we imagine making a pointer be a pair of the address and something vaguely like the address domains (usize, Forced | NonForced(Natural)) then equality comparison among pointers is just structural equality and so transitivity and other properties all work out.

This now relies on "forced or not" already being fixed when the allocation is created (which it is, thanks to angelic non-determinism; this is basically a prophecy variable), but -- I see, good point.

@JakobDegen
Copy link
Contributor

This now relies on "forced or not" already being fixed when the allocation is created (which it is, thanks to angelic non-determinism; this is basically a prophecy variable), but -- I see, good point.

Indeed, but this is even necessary to make pointer equality pure. It would be not great if it depended on some mutable external state (whether the forced flag is set for an allocation)

@CAD97
Copy link

CAD97 commented Jun 25, 2022

Something I stumbled across is that allocation planes might also be useful for the global allocation API boundary. It's off topic w.r.t. OOM since removing side effects of calls to #[global_allocator] is a different problem (and tbh just in the definition of #[global_allocator]'s effect), so TL;DR:

The global allocation boundary launders the #[global_allocator]'s allocation as a fresh “rust allocated object” on a new allocation plane. Being on a separate allocation plane justifies pointers being unrelated to the #[global_allocator]'s pointers — unless both objects are forced to be addressed on the concrete allocation plane, in which case you can observe that the allocated object is inside the global allocator's pool(s).

@arielb1
Copy link

arielb1 commented Jan 6, 2023

I like thinking of them as "realspace" pointers (which point to real memory) and "hyperspace" pointers.

Does this also work for #[global_allocator] - i.e., can we define the allocator function as nondeterministically either returning a hyperspace pointer or calling the allocator.

The last thing is that we need some guarantee that prevents hyperspace pointers from ending up as keys in a BTreeMap, because that would not work (stupid idea: "primitive" pointer comparison operators don't reify but <*const T as PartialOrd<*const T>> does).

For ABI calls - I like to think that when a Rust compiler picks a function, it has to nondeterministically pick between doing an "ABI" call or an "AM" call. If the compiler does an "ABI" call, it must ensure that all memory that the function can access has the right values, which probably implies that e.g. all parameters must be realspace pointers and must have a stable non-poison value. If the compiler does an "AM" call, the abstract machines on both sides of the calls must match (or you get NB), which is not going to happen if you are calling a function written in C.

I see a call as

match pick_nondeterministically() {
    DoAnAMCall => {
        if we don't know how to do that ( NB() }
        do an AM call
    }
    DoAnABICall => {
        reify all memory that needs to be reified (this can ofc. NB)
        put the parameters in the right places
        call the function
    }
}

@gereeter
Copy link

gereeter commented Feb 20, 2023

Merging allocations, i.e. converting a = malloc(a_size); b = malloc(b_size) into a = malloc(a_size + b_size); b = a.offset(a_size), unfortunately seems to be no longer valid with the proposed forced-or-not semantics.

In the following program, under the proposed semantics for pointers and pointer equality, overflow is impossible. Because a and b come from different allocations, the only way that a.offset(1) can compare equal to b is if they are both forced, which would mean that every allocation gives out a different address, exhausting the address space before ever overflowing the counter. However, merging a and b into a single allocation with b placed immediately after a would introduce this as new behavior; nothing forces the combined allocation, so they can share addresses and never run into OOM.

let mut counter: usize = 0;
loop {
    let a = malloc(1);
    let b = malloc(1):
    if a.offset(1) != b {
        break;
    }
    counter = counter.checked_add(1).expect("managed to allocate more objects than address space");
}

That said, I think this can be fairly easily fixed, and IMO the result is a simpler model. Instead of allocations (and pointers) having a boolean forced flag, they can have an integer allocation domain, where domain 0 corresponds to forced and all other domains are non-forced. Then, pointers are just (address, domain) pairs, compared structurally, and overlaps are forbidden within any one domain (not just in domain 0). In MiniRust:

pub struct AllocDomain(Int);

struct Allocation {
    domain: AllocDomain,
    // ...
}

impl Memory for FixedMemory {
    type Provenance = AllocDomain /* , AllocId, Tag, etc. in Stacked or Tree Borrows */;
    fn allocate(&mut self, size: Size, align: Align) -> NdResult<Pointer<AllocDomain>> {
        // ...
        // Non-deterministically choose whether to OOM
        if pick(/* ... */, |is_oom: bool| true)? {
            return Err(/* ... */);
        }
        // Non-deterministically choose a domain for the allocation to live in
        let domain = pick(/* ... */, |domain: AllocDomain| true)?;
        let addr = pick(/* ... */, |addr: Int| {
            // ...
            if self.allocations.any(|a| a.live && a.domain == domain && a.overlaps(addr, size)) { return false; }
            // If all tests pass, we are good!
            true
        })?;
        // ...
        ret(Pointer { addr, provenance: Some(domain) })
    }
}

impl Pointer<AllocDomain> {
    fn force_addr(self) -> Nondet<()> {
        if let Some(domain) = self.provenance {
            if domain.0 != 0 {
                // No behavior
                pick(/* ... */, |_| false)?;
            }
        }
        ret(())
    }
}

All behaviors of the forced-or-not semantics are allowed in this model (by making every non-forced domain contain exactly one allocation), but not too many are added. In particular:

  • As before, dead allocations can be removed, as long as all pointer comparisons to that allocation are rewritten to false (because the dead allocation can be placed in its own domain).
  • As before, if two live pointers compare equal, they either point to the same allocation, or one is at the end of an allocation and the other at the beginning of an allocation. This justifies the optimized equality tests with a fast path for pointer equality.
  • As before, pointer equality is deterministic, stateless, side-effect-free, and an equivalence relation.

@RalfJung
Copy link
Member Author

I have not seen this idea of replacing AllocIDs by "domains" containing multiple allocations before -- interesting.

I assume you will say that when lowering to assembly, the chosen domain will be 0 for all allocations, justifying that we only compare addresses in assembly? And forcing happens on ptr2int casts?

@gereeter
Copy link

gereeter commented Feb 20, 2023

Yes, exactly.

The previous discussion suggested a few other places that might require forcing in addition to ptr2int/expose_addr, but I don't think they materially affect the proposal:

  • FFI calls already have to be assumed to do arbitrary work, including possibly forcing input pointers.
  • The Ord instance for pointers can just order lexicographically, comparing domains when addresses are equal. As with the equality case, lowering makes all domains 0, so the assembly of comparing addresses gives the correct result.
    • (Edit: There might be something weird where comparison could be used to observe that there are only finitely many domains between two pointers of equal address and different domain, but I'm not convinced that is actually a problem, and to be safe we could just make domains be some dense total order like the rationals instead.)

@JakobDegen
Copy link
Contributor

Instead of allocations (and pointers) having a boolean forced flag, they can have an integer allocation domain, where domain 0 corresponds to forced and all other domains are non-forced

This was the model suggested here originally. We eventually got rid of it in favor of the current model because we felt that this version was simpler.

That being said, the example at the start of your message is super interesting. I'll need to think about that some more...

@RalfJung
Copy link
Member Author

RalfJung commented Feb 21, 2023 via email

@gereeter
Copy link

gereeter commented Feb 21, 2023

This was the model suggested here originally.

Ah, good to know. I did see comments about multiple domains (that's where I got the name "domains" from), but not knowing how to see previous versions of the HackMD, I assumed it was more in line with the boolean forced-or-not version, special-casing the forced domain and such like. My bad.

There might be something weird where comparison could be used to observe that there are only finitely many domains between two pointers of equal address and different domain

Working this out, I'm now convinced that a dense total/linear order is (slightly) better. This matches my intuition that unbounded dense linear orders are particularly simple orders, and the theory of UDLOs is easily decidable, which is convenient for model checking.

Consider the following program. If domains are integers with the normal integer ordering, this cannot loop forever. If domains form a UDLO, however, it can.
let lo = malloc(1);
let hi = malloc(1);
// lo and hi have the same address but different domains
assert!(lo < hi);
assert!(lo.addr() == hi.addr());
loop {
    // mid has the same address as lo and hi and a domain in between
    let mid = malloc(1);
    assert!(lo < mid && mid < hi);
}

The optimization that removes an unused allocation is justified by choosing some domain that will never be used otherwise in the program, allowing overlap and requiring rewriting all unused == other_ptr comparisons to false. This unique domain also needs to be ordered, and the only implementable option I see is consistency, making the unique domain less than all other domains to ever appear in the program. This rewrites unused < other_ptr to unused.addr() <= other_ptr.addr() and rewrites unused > other_ptr to unused.addr() > other_ptr.addr().

Unfortunately, on the example above, this optimization introduces the possibility of an infinite loop, meaning it is invalid for integer-ordered domains. If lo is optimized out, the loop assertion becomes lo.addr() <= mid.addr() && mid < hi, and this is infinitely satisfiable by choosing smaller and smaller domains for each mid.

So for the sake of removing unused allocations in the presence of pointer comparison, I think domains should form a UDLO.

(This really is a very minor issue, though. Most pointers aren't ordered, and this optimization, even though it is incorrect under integer domains, can only introduce infinite loops where every finite prefix appears in the original program.)

@gereeter
Copy link

gereeter commented Feb 21, 2023

The newly proposed model does away with per-allocation IDs entirely.

To be clear, I was still assuming that something like Tree Borrows would be used to declare out-of-bounds accesses and use-after-frees illegal, so there would be some provenance info unique to the allocation, possibly including the per-allocation ID. I was just focusing on the definition of pointer equality and the details needed for removing unused allocations.

@JakobDegen
Copy link
Contributor

JakobDegen commented Feb 28, 2023

What a wonderful example. Things like this are why I keep coming back to this repo.

To be honest, I wasn't really expecting to make pure Ord work in this kind of a model, but I don't immediately see a problem. The trade-off that I am thinking about though is whether it's worth it. This has the potential to surprise users, and my instinct is that pure Ord on pointers is not that important that this is the hill we need to die on. On the other hand, I also can't think of any way a user will discover this except by reading the spec, so might not actually be an issue

@workingjubilee
Copy link
Member

Observationally, programmers already find a lot of details about the pointer comparison semantics surprising, so if we add another gotcha but in doing so make the rules about pointers more overall coherent (and thus "more understandable, even if the first brush violates your intuitions"), I think it's a win.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 19, 2023

@gereeter I don't follow which orders you are talking about -- how exactly are the various orders you are referring to defined?

the only implementable option I see is consistency, making the unique domain less than all other domains to ever appear in the program. This rewrites unused < other_ptr to unused.addr() <= other_ptr.addr()

You said earlier they should be ordered lexicographically (first domain, then address), which would make unused < other just be true if unused is in the lowest domain -- I am confused. (This requires the compiler to ensure that other is not pointing to the unused allocation.)

Do compilers even remove allocations if their addresses are compared with other pointers?

// mid has the same address as lo and hi and a domain in between

The code doesn't check for 'same address' though?

To me it seems like the source of trouble here is that you assumed there was a "lowest" domain, but then also assumed the domains are integers, and obviously there is no smallest integer, so that is just not coherent. I would have expected the optimization to simply not kick in if there are any such comparisons; that means the compiler can choose any domain.

@saethlin
Copy link
Member

saethlin commented Mar 19, 2023

Do compilers even remove allocations if their addresses are compared with other pointers?

Yes? https://godbolt.org/z/zfzPhMPfo

pub fn uwu(x: *const u8) -> bool {
    let y = 1u8;
    if x == &y as *const u8 {
        true
    } else {
        false
    }
}
define noundef zeroext i1 @_ZN7example3uwu17h0b9a63f97a9ab0f7E(ptr nocapture noundef readnone %x) unnamed_addr #0 !dbg !6 {
  ret i1 false, !dbg !11
}

For completeness, Box does not get the same optimization: https://godbolt.org/z/efnP7s4jK

@RalfJung
Copy link
Member Author

With < it still seems to actually do a comparison though, so this does not suffice for the example above.

@gereeter
Copy link

Do compilers even remove allocations if their addresses are compared with other pointers?

Not directly, as far as I know, when ordering is involved. The integer vs. UDLO quibble is an extremely minor side note, and my justification for preferring e.g. rationals was based on a hypothetical optimization.

That said, compilers do optimize pointer order comparisons, and can indirectly remove such allocations by first optimizing away the comparison. (For example, in uwu above, replacing x == &y with x >= &y && x <= &y gives the same result.) This means that e.g. @arielb1's suggestion of having <*const T as PartialOrd<*const T>> reify/force the pointers does not work, as removing a force_addr() call is invalid. This is why I was looking at ordering in the first place: we need some definition, and I was testing the waters to see if any problems arose.

how exactly are the various orders you are referring to defined?

You said earlier they should be ordered lexicographically (first domain, then address),

No, first address, then domain. I think my wording ("comparing domains when addresses are equal") was a bit confusing.

To be extra precise, I'm considering two cases, namely pointers as (address, domain) pairs (plus borrow tags, which are irrelevant here), where the domain is an integer and pointers are compared lexicographically (first address, then domain), and pointers as (address, domain) pairs (plus borrow tags, which are irrelevant here), where the domain is e.g. a rational number and pointers are compared lexicographically (first address, then domain).

@RalfJung
Copy link
Member Author

Oh, I got the field order wrong, that explains my confusion.

replacing x == &y with x >= &y && x <= &y

Well, those two are equivalent, so this doesn't tell us a lot about the structure of <= -- but indeed it does tell us that PartialOrd doesn't force.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 5, 2024

Some relevant discussion: rust-lang/miri#3450 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests