Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rollback to a consistent state after fsync error #131

Merged
merged 31 commits into from
Nov 11, 2021

Conversation

MrCroxx
Copy link
Member

@MrCroxx MrCroxx commented Oct 27, 2021

Make PipeLog try to rollback to a consistent state after fsync error.

According to PostgreSQL's fsync() surprise, fsync is not retriable. We must make pipe log rollback to where it last synced.

What's changed?

invariant: non-active log files are always synced and safe.

Updated:

After discussion, the final decisions of this pr are listed below:

  1. Panics when there is any fsync error.
  2. Truncate active log file when there is write error during buffer writing log file.
  3. Panics when there is any error during rotating log file (truncate/sync/create).

Introduce pre_write and post_write for PipeLog to acquire or release a WriteContext, which can be used to track the sync state.

If fsync raises an error, log will rollback to last synced state, and set errors for the writers in the current write group. But if there are bytes rollbacked but in writers in the previous write groups, RaftEngine will panic, because that means some data that the user thinks is safe but lost actually.

new_log_file is devided into 2 phase. First truncate and sync the old log file, then rotate to a new log file. The second phase is safe to retry if sync_dir fails, because the new log will overwrite the failed one.

TODO:

  • Unit tests with failpoints.

Signed-off-by: MrCroxx mrcroxx@outlook.com

invariant: non-active log files are always synced and safe.

Introduce `pre_write` and `post_write` for `PipeLog` to acquire
or release a `WriteContext`, which can be used to track the sync
state.

If `fsync` raises an error, log will rollback to last synced state,
and set errors for the writers in the current write group. But if
there are bytes rollbacked but in writers in the previous write
groups, RaftEngine will panic, because that means some data that
the user thinks is safe but lost actually.

`new_log_file` is devided into 2 phase. First sync old log file,
then rotate new log file. The second phase is safe to retry if
`sync_dir` fails, because the new log will overwrite the failed
one.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
@codecov
Copy link

codecov bot commented Oct 28, 2021

Codecov Report

Merging #131 (8588698) into master (9f0c313) will increase coverage by 0.33%.
The diff coverage is 98.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #131      +/-   ##
==========================================
+ Coverage   96.69%   97.02%   +0.33%     
==========================================
  Files          22       23       +1     
  Lines        5816     6194     +378     
==========================================
+ Hits         5624     6010     +386     
+ Misses        192      184       -8     
Impacted Files Coverage Δ
src/errors.rs 100.00% <ø> (ø)
src/pipe_log.rs 93.10% <ø> (ø)
tests/failpoints/mod.rs 100.00% <ø> (ø)
src/engine.rs 96.87% <90.00%> (+0.61%) ⬆️
tests/failpoints/test_io_error.rs 98.13% <98.13%> (ø)
src/file_pipe_log.rs 95.69% <100.00%> (+0.33%) ⬆️
src/log_file.rs 92.15% <100.00%> (+3.92%) ⬆️
src/purge.rs 94.78% <100.00%> (+0.27%) ⬆️
tests/failpoints/test_engine.rs 100.00% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f0c313...8588698. Read the comment docs.

Copy link
Member

@tabokie tabokie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, I think it's risky to silently rollback on buffer write failure. Consider this case:

  1. write (r=1, index=1)
  2. write (r=1, index=2,3,4) failed, but index 2-3 is written to disk
  3. write (r=2, index=1) succeeded, this write overwrite (r=1, index=2), but (r=1, index=3) remains
  4. engine restart, the data is inconsistent

This is unsolvable unless we tolerate logical error when recovering tail records. but that's difficult to reason.

I think it would be better to physically truncate file when buffer write fails. This truncate call could fail too, so we must keep track of outstanding dirty writes in a counter (e.g. written_dirty). Next buffer writes must attempt to do the truncate again.

A rough sketch of my thoughts:

buffered_write:
  if written_dirty > written {
    truncate(written)?;
  }
  let r = pwrite(...);
  if r.is_err()  {
    if truncate(written).is_err() {
      written_dirty += len;
    }
  } else {
    written += len;
  }
  return r;

post_write:
  if let Err(e) = fsync() {
    // panic if needed
    if truncate(last_sync).is_err() {
      written_dirty = written;
    }
    written = last_sync;
    return e;
  }
  if needs_rotate {
    // Errors from these calls don't bubble to user.
    if ftruncate(old_file).is_ok() && open(new_file).is_ok() && fsync_dir().is_ok() {
      if let Err(e) = pwrite(header) {
        written_dirty = header.len
      } else {
        written = header.len
      }
    }
  }

src/file_pipe_log.rs Outdated Show resolved Hide resolved
return Err(Error::Fsync(ctx.synced, e.to_string()));
}
if ctx.syncable == Syncable::NeedRotate {
self.mut_queue(queue).rotate()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When last_sync == 0, written = LOG_FILE_HEADER_LEN, we should tolerate fsync error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl split rotation to 2 parts, first call truncate_and_sync for the current log, then call rotate to create and move the writer to the new log file. So calling rotate here means the old log file has already been truncated and synced.

src/file_pipe_log.rs Outdated Show resolved Hide resolved
@tikv tikv deleted a comment from coveralls Oct 28, 2021
@MrCroxx
Copy link
Member Author

MrCroxx commented Oct 28, 2021

On a second thought, I think it's risky to silently rollback on buffer write failure. Consider this case:

  1. write (r=1, index=1)
  2. write (r=1, index=2,3,4) failed, but index 2-3 is written to disk
  3. write (r=2, index=1) succeeded, this write overwrite (r=1, index=2), but (r=1, index=3) remains
  4. engine restart, the data is inconsistent

@tabokie Agree with you on the case.

I think it would be better to physically truncate file when buffer write fails. This truncate call could fail too, so we must keep track of outstanding dirty writes in a counter (e.g. written_dirty). Next buffer writes must attempt to do the truncate again.

IMO, I think this method shares the some outcome if the engine restart at the moment that fsync error happens. If we keep written_dirty in memory, when RaftEngine restarts, it still won't know if the tail has been successfully truncated yet, and the case cannot be solved. Maybe RaftEngine should panic if the truncate after fsync error fails is acceptable, which means the disk is highly unreliable.

@tabokie
Copy link
Member

tabokie commented Oct 28, 2021

It's not needed, written_dirty is only there to block incoming writes (actually a bool variable suffice). If a dirty write isn't overwritten by other writes, then the disk state is still consistent.

And I made a mistake with the case, ideally a log batch can only be atomically written to disk. step 2 is unlikely to happen. So I guess my proposal can be reduced to only physically truncate file for "write group" fsync failure, not buffered write failure.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Copy link
Member

@tabokie tabokie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs unit tests. I've added some IO failpoints in #139, every one of them needs to be tested. You can use catch_unwind_silent to test the panics.

src/engine.rs Outdated Show resolved Hide resolved
src/engine.rs Outdated Show resolved Hide resolved
.append(LogQueue::Rewrite, log_batch.encoded_bytes(), sync)?;
let file_handle = self
.pipe_log
.append(LogQueue::Rewrite, log_batch.encoded_bytes())?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not calling truncate after this failure.

src/file_pipe_log.rs Outdated Show resolved Hide resolved
src/file_pipe_log.rs Outdated Show resolved Hide resolved
src/file_pipe_log.rs Outdated Show resolved Hide resolved
src/file_pipe_log.rs Outdated Show resolved Hide resolved
src/pipe_log.rs Outdated Show resolved Hide resolved
src/errors.rs Outdated Show resolved Hide resolved
Panics directly in the following cases:

- Truncate failed after buffer write error.
- Sync or rotate failed.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Cargo.toml Outdated
@@ -58,3 +58,9 @@ members = [ "stress" ]
name = "bench_recovery"
harness = false
required-features = ["failpoints"]

[[test]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should send another PR to move all existing failpoint tests to this folder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'd like to move io error test to this folder first in this PR, then send another one to move the existing test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then add a mod.rs for failpoints folder, listing individual test in main Cargo.toml doesn't look good.

@@ -160,9 +161,6 @@ impl<W: Seek + Write> ActiveFile<W> {
}

fn rotate(&mut self, fd: Arc<LogFd>, writer: W) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename it to reset


// b0 (ctx); b1 success; b2 fail, truncate; b3 success
let mut hook = FailpointsHook::new();
hook.register_pre_append_action(2, || {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like this approach. Could you try using cfg_callback? might be easier.

and fix ci

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Copy link
Member

@tabokie tabokie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing.

Cargo.toml Outdated
@@ -58,3 +58,9 @@ members = [ "stress" ]
name = "bench_recovery"
harness = false
required-features = ["failpoints"]

[[test]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then add a mod.rs for failpoints folder, listing individual test in main Cargo.toml doesn't look good.

failpoints/io_error_test.rs Outdated Show resolved Hide resolved
failpoints/io_error_test.rs Outdated Show resolved Hide resolved
failpoints/io_error_test.rs Outdated Show resolved Hide resolved
let timer = AtomicU64::new(0);
fail::cfg_callback("engine::write::pre", move || {
match timer.fetch_add(1, std::sync::atomic::Ordering::SeqCst) {
2 => fail::cfg("log_fd::write::post_err", "return").unwrap(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between post_err and err? If you want to test file is truncated, a engine restart is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for checking if the following appends correctly overwrite the pervious ones.

src/file_pipe_log.rs Outdated Show resolved Hide resolved
@MrCroxx
Copy link
Member Author

MrCroxx commented Nov 4, 2021

Tests are failing.

Yep. It works fine with my env and I’m looking for the reason.

@MrCroxx MrCroxx force-pushed the extract-fsync-handle branch 4 times, most recently from 09ed081 to 790bdce Compare November 5, 2021 13:16
tabokie and others added 2 commits November 6, 2021 16:11
tests/failpoints/test_io_error.rs Outdated Show resolved Hide resolved
tests/failpoints/test_io_error.rs Outdated Show resolved Hide resolved

if manager.active_file.written >= manager.rotate_size {
if let Err(e) = manager.rotate() {
panic!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says this panic is not covered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the coverage test locally and it said these lines are covered. And I checked manually and made sure that it is covered.

image

I've also checked the CI script, which seems fine. I'm not sure if it's a bug of the CI coverage test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found where the problem lays, fixing.

tests/failpoints/test_io_error.rs Show resolved Hide resolved
fail::cfg("active_file::truncate::force", "return").unwrap();
fail::cfg("log_fd::truncate::err", "return").unwrap();
assert!(catch_unwind_silent(|| {
write_tmp_engine(ReadableSize::kb(1024), ReadableSize::kb(4), 1024, 1, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_tmp_engine seems an overkill.

src/log_file.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
@MrCroxx
Copy link
Member Author

MrCroxx commented Nov 11, 2021

stress --regions 1000 --write-sync true --time 600

[current]
Throughput(QPS) = 10119.08
Latency(μs) min = 40, avg = 83.03, p50 = 61, p90 = 161, p95 = 187, p99 = 230, p99.9 = 288, max = 24607
Fairness = 100.0%
Write Bandwidth = 7.5MiB/s

[master]
Throughput(QPS) = 10249.05
Latency(μs) min = 42, avg = 80.07, p50 = 59, p90 = 155, p95 = 174, p99 = 214, p99.9 = 265, max = 16639
Fairness = 100.0%
Write Bandwidth = 7.7MiB/s

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants