Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple data-race detector #1617

Merged
merged 17 commits into from
Nov 29, 2020
Merged

Conversation

JCTyblaidd
Copy link
Contributor

@JCTyblaidd JCTyblaidd commented Nov 2, 2020

Partially fixes data-race detection, see #1372, based on Dynamic Race Detection for C++11

  • This does not explore weak memory behaviour, only exploring one sequentially consistent ordering.
  • Data-race detection is only enabled after the first thread is created, so should have minimal overhead for non-concurrent execution.
  • Does not attempt to re-use thread id's so creating and joining threads lots of time in an execution will result in the vector clocks growing in size and slowing down program execution It does now

src/data_race.rs Outdated

use smallvec::SmallVec;

use crate::*;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use crate::*;
use crate::MiriEvalContext;

Since there are only a couple of includes using crate below?

src/data_race.rs Outdated
Comment on lines 61 to 62
impl<'mir, 'tcx: 'mir> EvalContextExt<'mir, 'tcx> for crate::MiriEvalContext<'mir, 'tcx> {}
pub trait EvalContextExt<'mir, 'tcx: 'mir>: crate::MiriEvalContextExt<'mir, 'tcx> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<'mir, 'tcx: 'mir> EvalContextExt<'mir, 'tcx> for crate::MiriEvalContext<'mir, 'tcx> {}
pub trait EvalContextExt<'mir, 'tcx: 'mir>: crate::MiriEvalContextExt<'mir, 'tcx> {
impl<'mir, 'tcx: 'mir> EvalContextExt<'mir, 'tcx> for MiriEvalContext<'mir, 'tcx> {}
pub trait EvalContextExt<'mir, 'tcx: 'mir>: MiriEvalContextExt<'mir, 'tcx> {

src/data_race.rs Outdated
let data_race = &*this.memory.extra.data_race;
let old = data_race.multi_threaded.get();

data_race.multi_threaded.set(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can use let old = data_race.multi_threaded replace(false);.

src/data_race.rs Outdated
"Invalid alt (>=):\n l: {:?}\n r: {:?}",l,r
);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing trailing newline here and in other files.

@RalfJung
Copy link
Member

RalfJung commented Nov 2, 2020

@JCTyblaidd thanks a lot! This is coming quite out of the blue, but a welcome surprise. :)

Unfortunately I will probably not have the time during the week to review such a massive PR. I hope next week-end will work, but it might also be 2 weeks until I have a few free hours to do the initial review.

Data-race detection is only enabled after the first thread is created, so should have minimal overhead for non-concurrent execution.

Conceivably a program could briefly create a thread for a one-off task and then continue running mostly sequentially afterwards. So I think it would still be good to do some benchmarking to measure the impact of this. Unfortunately Miri does not have an established benchmarking infrastructure, all we have are two little crates that I used for some ad-hoc benchmarking in the past. Could you measure the impact of your PR on those cases (after adding a dummy thread created in the beginning)? hyperfine would be a good way to get some reliable numbers. (There's also the benches folder which exists from ye olde tiems and which I never used for anything...)

@bjorn3
Copy link
Member

bjorn3 commented Nov 2, 2020

Conceivably a program could briefly create a thread for a one-off task and then continue running mostly sequentially afterwards.

Maybe it could be disabled again once all threads except the main thread have been joined.

@RalfJung
Copy link
Member

RalfJung commented Nov 2, 2020

Well, then there could be a background thread just sitting there waiting for some cleanup to need happening, but not actually doing anything. (Crossbeam has such a thread, AFAIK.)

Basically, I am worried about making "there's a thread somewhere doing nothing" becoming a huge performance cliff.

@JCTyblaidd
Copy link
Contributor Author

Only did the tests for one of the benchmarks, this only adds one dangling thread so the overhead might be reduced by being inside the small vector size limit. Will try get more numbers later.


Single-threaded:

time ./miri run ./bench-cargo-miri/mse/src/main.rs -Zmiri-disable-stacked-borrows
real 0m1.268s
user 0m1.053s
sys 0m0.232s


hyperfine --warmup 2 'target/x86_64-unknown-linux-gnu/release/miri --sysroot ~/.cache/miri/HOST ./bench-cargo-miri/mse/src/main.rs -Zmiri-disable-stacked-borrows'
Benchmark #1: target/x86_64-unknown-linux-gnu/release/miri --sysroot ~/.cache/miri/HOST ./bench-cargo-miri/mse/src/main.rs -Zmiri-disable-stacked-borrows
  Time (mean ± σ):     716.6 ms ±   6.6 ms    [User: 694.5 ms, System: 20.4 ms]
  Range (min … max):   708.0 ms … 729.2 ms    10 runs

Single-threaded with overhead:

time ./miri run ./bench-cargo-miri/mse_and_dangling_thread/src/main.rs -Zmiri-disable-stacked-borrows
warning: thread support is experimental.

real 0m2.184s
user 0m1.176s
sys 0m0.283s


hyperfine --warmup 2 'target/x86_64-unknown-linux-gnu/release/miri --sysroot ~/.cache/miri/HOST ./bench-cargo-miri/mse_and_dangling_thread/src/main.rs -Zmiri-disable-stacked-borrows'
Benchmark #1: target/x86_64-unknown-linux-gnu/release/miri --sysroot ~/.cache/miri/HOST ./bench-cargo-miri/mse_and_dangling_thread/src/main.rs -Zmiri-disable-stacked-borrows
  Time (mean ± σ):     744.0 ms ±   7.1 ms    [User: 726.1 ms, System: 17.9 ms]
  Range (min … max):   736.5 ms … 759.7 ms    10 runs

Changed benchmark with variant:
fn main() {
    let thread = std::thread::spawn(|| 4);
    for _ in 0..2 {
        mse(PCM.len(), PCM, EXPECTED);
    }
    assert_eq!(4, thread.join().unwrap());
}

@RalfJung
Copy link
Member

RalfJung commented Nov 2, 2020

Thanks for measuring this! Is the "716.6 ms ± 6.6 ms" using your branch, or master? I am surprised that the difference is so small (doesn't this affect every single memory access?), also given that time via ./miri shows a much bigger impact.

@JCTyblaidd
Copy link
Contributor Author

"716.6 ms ± 6.6 ms" is with my branch, which should is similar to current upstream master with the only overhead of an easily speculated branch that is always false to skip data-race checking, seems to be within the margin of error. This could be the benchmark being used though, if the values do not escape into memory and are kept inline? More benchmarks would probably be better but I can't get the other one to compile, it complains about importing serde_json.

Upstream master

  Time (mean ± σ):     723.2 ms ±   4.3 ms    [User: 706.9 ms, System: 16.1 ms]
  Range (min … max):   716.6 ms … 728.7 ms    10 runs

The overhead might be minimal in the above "744.0 ms ± 7.1 ms" case because(maybe wrong):

  • values are re-used? the range-map hence needs minimal updating for internal memory cells
  • there are no atomic/join/thread-create operations during the calculation, so the Timestamp remains constant and hence the range-map might be able to remain in the state of WRITE_THREAD=0, LAST_WRITE=K, READS = {0: K} (inline) for the entire allocation with no need to differentiate?
  • only 2 threads so the read vector for data-races remains an inline smallvec
  • as above, values don't escape to memory?

Probably need more benchmarks to get a clearer picture.

 previously only data races between two non-atomic accesses were
 detected.
 code & add support for vector index re-use for multiple threads
 after termination.
 convert validate atomic op to use shared reference and get_raw
 instead of get_raw_mut so it can be used for validate_atomic_load as well
Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great overall, thanks a lot for this impressive piece of work! My comments are mostly formatting and grammar nits, and some naming suggestions.

I did not fully understand every part of the algorithm, but I do not necessarily have to. What I saw mostly makes intuitive sense, except for release sequences which I need to look into further, and some parts of the handling of non-atomic accesses.

What I missed is anything specific to SeqCst. SeqCst induces a global ordering on all sequentially consistent operations, shouldn't that affect the data race detector? Currently, if I saw correctly, SeqCst is fully equivalent to always using the strongest available release/acquire ordering.

authors = ["Ralf Jung <post@ralfj.de>"]
edition = "2018"

[dependencies]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to duplicate this example code... isn't there a better way, such as adding a cfg flag or so?

src/machine.rs Outdated
@@ -109,12 +109,15 @@ impl fmt::Display for MiriMemoryKind {
pub struct AllocExtra {
/// Stacked Borrows state is only added if it is enabled.
pub stacked_borrows: Option<stacked_borrows::AllocExtra>,
/// Data race detection via the use of a vector-clock.
pub data_race: data_race::AllocExtra,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to -Zmiri-disable-stacked-borrows, I think it would make sense to have -Zmiri-disable-data-race-detector (or so) which turns this field into None to ensure there is no overhead.

src/machine.rs Outdated
}

/// Extra global memory data
#[derive(Clone, Debug)]
pub struct MemoryExtra {
pub stacked_borrows: Option<stacked_borrows::MemoryExtra>,
pub data_race: data_race::MemoryExtra,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should then likely also be an Option field if we have -Zmiri-disable-stacked-borrows.

@@ -78,7 +78,10 @@ pub fn futex<'tcx>(
// Read an `i32` through the pointer, regardless of any wrapper types.
// It's not uncommon for `addr` to be passed as another type than `*mut i32`, such as `*const AtomicI32`.
// FIXME: this fails if `addr` is not a pointer type.
let futex_val = this.read_scalar_at_offset(addr.into(), 0, this.machine.layouts.i32)?.to_i32()?;
// FIXME: what form of atomic operation should the `futex` use to load the value?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question... aren't there any docs or so defining this?

ecx.read_scalar_at_offset(mutex_op, offset, ecx.machine.layouts.i32)
ecx.read_scalar_at_offset_atomic(
mutex_op, offset, ecx.machine.layouts.i32,
AtomicReadOp::SeqCst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be SeqCst? I would have expected Acquire. (Same for the other reads in this file I think.)

src/data_race.rs Outdated Show resolved Hide resolved
src/data_race.rs Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
@JCTyblaidd
Copy link
Contributor Author

I've pushed a few commits that should fix the requests. As i added into a comment inside the data-race descriptor code, per the C++ specification on sequentially consistent ordering:

A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order (see Sequentially-consistent ordering below)

This means, since we currently do not model weak memory effects and the scheduling algorithm already provides a global total order of all memory operations, that seq-cst loads/stores are equivalent to acquire loads/release stores and a seq-cst RMW is equivalent to a acquire-release RMW.

@JCTyblaidd
Copy link
Contributor Author

And regarding vector index re-use, i tidied up the explanation in the comments. But the high-level view is a vector-index can only be re-used once the thread has terminated, been joined, and the effects of the join happen-before the current vector clock of all currently active threads.

Therefore no data-races can possible be reported with the old thread, since the thread local happens-before clocks do not decrease, any stored timestamps from that index must be less than all active vector clocks, which means that no data-races can ever be reported at that index, so the vector-index stops providing any value. The program will then advance the vector index by 1 and then allow the vector-index to be re-used for a newly created thread.

src/machine.rs Outdated Show resolved Hide resolved
src/sync.rs Show resolved Hide resolved
src/vector_clock.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

(I only had time to go over the first of your new commits today, will continue tomorrow. Thanks for your patience!)

@RalfJung
Copy link
Member

But the high-level view is a vector-index can only be re-used once the thread has terminated, been joined, and the effects of the join happen-before the current vector clock of all currently active threads.

Ah, with this extra last condition about all threads, that makes of sense -- thanks!

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! This is my final round of comments/questions.

Sadly, I do not have the time to actually understand the underlying data race detection algorithm. But thankfully you based this of a POPL paper, so I am confident that if necessary, I can dig into that paper to understand the grand scheme of things. If there are ways in which you diverged from the paper, or particular implementation choices you made (beyond what you already explained in comments), I'd much appreciate if you could add those to the comment at the top of the data race detector, so that whoever tries to match up paper and implementation in the future will have it as easy as you can make it.

I also like your compile-fail tests. Are there others you could imagine having that you didn't have the time to write? If so, please open an issue, so maybe someone can pick this up in the future. One thing I wondered about is a test related to release sequences -- something that is UB because the release sequence got broken.

src/data_race.rs Outdated Show resolved Hide resolved
src/shims/posix/sync.rs Outdated Show resolved Hide resolved
src/shims/posix/thread.rs Outdated Show resolved Hide resolved
src/data_race.rs Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
tests/compile-fail/data_race/enable_after_join_to_main.rs Outdated Show resolved Hide resolved
tests/compile-fail/data_race/rmw_race.rs Show resolved Hide resolved
tests/run-pass/concurrency/disable_data_race_detector.rs Outdated Show resolved Hide resolved
@JCTyblaidd
Copy link
Contributor Author

Fixed review changes, I added some comments to validate_lock_release(_shared), incrementing the clocks twice per operation seems to be unnecessary and I think was a hold-over from an earlier version, so changed to a single increment. mutex/rwlock/condvar get_set_id have been made relaxed, they were converted to atomic since the current internal implementation reports a data-race when non-atomic accesses are used(for mutex is in pthread_mutex_lock, i think from mutex_get_or_create_id), it might be a bug in the pthreads shim implemetation, I am not sure either way, if it is a bug then the access should be converted back to non-atomic once fixed.

@RalfJung
Copy link
Member

I am not surprised to see atomic accesses in the mutex implementation, this makes sense -- they just should not induce any synchronization themselves. So Relaxed is just right, thanks!

src/data_race.rs Outdated
@@ -37,6 +37,24 @@
//! to a acquire load and a release store given the global sequentially consistent order
//! of the schedule.
//!
//! The timestamps used in the data-race detector assign each sequence of non-atomic operations
//! followed by a single atomic or concurrent operation a single timestamp.
//! Write, Read, Write, ThreadJoin will be represented by a single timestamp value on a thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a full stop missing at the end here?

Also, does this match what the paper does? If the paper increments before and after, then adding a comment that this is somewhat strange but see the paper for details seems better than trying to adjust the algorithm the paper authors came up with.

src/data_race.rs Outdated
@@ -499,7 +517,7 @@ pub trait EvalContextExt<'mir, 'tcx: 'mir>: MiriEvalContextExt<'mir, 'tcx> {
}

/// Perform an atomic compare and exchange at a given memory location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Perform an atomic compare and exchange at a given memory location
/// Perform an atomic compare and exchange at a given memory location.

src/data_race.rs Outdated
Comment on lines 1338 to 1339
/// For normal locks this should be equivalent to `validate_lock_release`
/// this function only exists for joining over the set of concurrent readers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// For normal locks this should be equivalent to `validate_lock_release`
/// this function only exists for joining over the set of concurrent readers
/// For normal locks this should be equivalent to `validate_lock_release`.
/// This function only exists for joining over the set of concurrent readers

@@ -62,9 +62,11 @@ fn mutex_get_kind<'mir, 'tcx: 'mir>(
mutex_op: OpTy<'tcx, Tag>,
) -> InterpResult<'tcx, ScalarMaybeUninit<Tag>> {
let offset = if ecx.pointer_size().bytes() == 8 { 16 } else { 12 };
//FIXME: this has been made atomic to fix data-race reporting inside the internal
// mutex implementation, it may not need to be atomic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to leave a FIXME here I think (same in the rest of this file).

src/data_race.rs Outdated
Comment on lines 654 to 656
/// Range of Vector clocks, this gives each byte a potentially
/// unqiue set of vector clocks, but merges identical information
/// together for improved efficiency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Range of Vector clocks, this gives each byte a potentially
/// unqiue set of vector clocks, but merges identical information
/// together for improved efficiency.
/// Assigning each byte a MemoryCellClocks.

"Set" is a bit confusing here since it sounds like it might be a HashSet or so. Also, the fact that RangeMap is internally more efficient should not matter at this level.

…ing into it again, it seems the paper only increments the timestamp after release operations, so changed to approximation of that implementation.
src/data_race.rs Outdated Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

Thanks again @JCTyblaidd for the PR and for working with me through the review. This is an amazing addition to Miri. :-)

@bors r+

@bors
Copy link
Collaborator

bors commented Nov 29, 2020

📌 Commit cbb695f has been approved by RalfJung

@bors
Copy link
Collaborator

bors commented Nov 29, 2020

⌛ Testing commit cbb695f with merge d473242...

@bors
Copy link
Collaborator

bors commented Nov 29, 2020

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing d473242 to master...

@bors bors merged commit d473242 into rust-lang:master Nov 29, 2020
This was referenced Nov 29, 2020
m-ou-se added a commit to m-ou-se/rust that referenced this pull request Dec 1, 2020
update Miri

This update includes rust-lang/miri#1617, the data race detector by `@JCTyblaidd.` :)

Cc `@rust-lang/miri` r? `@ghost`
m-ou-se added a commit to m-ou-se/rust that referenced this pull request Dec 1, 2020
update Miri

This update includes rust-lang/miri#1617, the data race detector by ``@JCTyblaidd.`` :)

Cc ``@rust-lang/miri`` r? ``@ghost``
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants