[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

romseygeek · 2018-05-14T14:49:00Z

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=sles/2435/console

After a file is corrupted, snapshot should throw an error, but it seems to succeed instead. Doesn't reproduce locally though.

consoleText.txt

elasticmachine · 2018-05-14T14:49:01Z

Pinging @elastic/es-distributed

tlrx · 2018-05-17T08:57:48Z

This test creates an index with 0 replica, merges disabled and a very high value for the flush translog size setting. Then it corrupts a file in a random primary shard (in this failure, it is _2.si), registers a repository and creates a snapshot.

When creating the snapshot, the test expects that the shard snapshot process loads the store metadata and then fails because one of the Lucene files is corrupted, failing the shard snapshot and marking the snapshot as PARTIAL.

In this test failure the snapshot completed as SUCCESS. I looked closely at it and I can't figure why it didn't fail when loading the store metadata. Of course it does not reproduce locally. The test correclty corrupted the file:

[INFO ][test                     ] Corrupting file --  flipping at position 68 from 0 to 1 file: _2.si
[INFO ][test                     ] Checksum before: [2483589937] after: [1016601414] checksum value after corruption: 2483589937] file: _2.si length: 387

I suspect a test bug, or maybe that the file was not part of the snapshot but it should have been. Merges are disabled and flushes are manually executed before corrupting the file.

I pushed 7915b5f to add more debug information, hopefully this error will appear again and we'll be able to grab the shard files and the snapshotted files.

This test failed but the cause is not obvious. This commit adds more debug logging traces so that if it reproduces we could gather more information. Related #30577

ywelsch · 2018-05-17T10:24:13Z

@dnhatn With the changes that were made to CombinedDeletionPolicy, do you perhaps know if we can expect all unreferenced Lucene files to be cleaned up after a flush?

dnhatn · 2018-05-18T14:52:20Z

@ywelsch I have looked at the test.
If the latest value of the global checkpoint has not fsynced before a flush, we will have two commits; then if we corrupt some files which are referenced only by the previous commit, the snapshot of the latest commit will be SUCCESS instead of PARTIAL.

I will follow-up this test with @tlrx.

ywelsch · 2018-05-18T14:56:21Z

awesome!

Relates #30577

tlrx · 2018-05-18T15:23:10Z

Good catch @ywelsch and @dnhatn. But what could make a file referenced by a commit point and not by the next one, when merges are disabled and the shard manually flushed?

dnhatn · 2018-05-18T16:51:36Z

@tlrx You're right. My assumption is not correct in this case.

Tracked at #30577

dnhatn · 2018-05-19T15:21:21Z

@tlrx I will be taking care of this.

dnhatn · 2018-05-20T01:44:02Z

@tlrx and @ywelsch I've reproduced and have an explanation for this. This is possibly caused by LUCENE-8253

indexRandom(true, builders) not only add docs but also index/delete bogus docs.

[test][4] Index [fEPBemMBIyPePwGlvrH9]
 [test][4] Index [bogus_doc_ीनठफड़जऊa44]
[test][4] Index [oUPBemMBIyPePwGlvrH-]
[test][4] Delete [bogus_doc_ीनठफड़जऊa44]

With a special seed (A9D224BA3576B58B:E8899F47DB14C3CA), indexRandom might create a fully deleted segment in which all documents are deleted.

Segment _2(7.4.0):c1, files [_2.si, _2.cfe, _2.cfs], maxDocs=[1], numDocs=[0], delDocs=[1], seq#=[6]

Previously, a fully deleted segment is kept around until the next commit, however since LUCENE-8253 it is dropped immediately. The problem is that its files, which are not referenced by any commit point, are not released but retained in IndexFileDeleter's lastFiles.

IFD's last files:  [_0.si, _3.si, _1.cfe, _1.si, _3_1.liv, _2.si, 
_0.cfs, _2.cfs, _3.cfe, _1.cfs, _0.cfe, _2.cfe, _3.cfs, _2_1.liv]

We then corrupt _2.cfe which belongs to the dropped segment.

Corrupting file --  flipping at position 197 from 0 to 1 file: _2.cfe

Taking snapshot should be ok because the last commit does not includes the corrupted file.

Loading store metadata using SnapshotIndexCommit{segments_3},
files: [_1.cfs, _0.cfe, _0.si, _3.si, _1.cfe, _1.si, _3_1.liv, _0.cfs, _3.cfs, _3.cfe, segments_3]

I opened LUCENE-8324. If @s1monw agrees to fix this in Lucene, we are good; otherwise we need to update the test.

tlrx · 2018-05-22T10:59:12Z

Thanks @dnhatn. I suspected something like that but I did not look deeply enough.

The new snapshot includes LUCENE-8324 which fixes missing checkpoint after a fully deletes segment is dropped on flush. This snapshot should resolves failed tests in the CorruptedFileIT suite. Closes #30741 Closes #30577

This test failed but the cause is not obvious. This commit adds more debug logging traces so that if it reproduces we could gather more information. Related elastic#30577

The new snapshot includes LUCENE-8324 which fixes missing checkpoint after a fully deletes segment is dropped on flush. This snapshot should resolves failed tests in the CorruptedFileIT suite. Closes #30741 Closes #30577

romseygeek added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0 labels May 14, 2018

tlrx self-assigned this May 17, 2018

tlrx added a commit that referenced this issue May 17, 2018

[Tests] Add debug information to CorruptedFileIT

7915b5f

This test failed but the cause is not obvious. This commit adds more debug logging traces so that if it reproduces we could gather more information. Related #30577

dnhatn added a commit that referenced this issue May 18, 2018

TEST: Add engine log to testCorruptFileThenSnapshotAndRestore

95f52f0

Relates #30577

dnhatn added a commit that referenced this issue May 19, 2018

Mute testCorruptFileThenSnapshotAndRestore

29f647e

Tracked at #30577

dnhatn added a commit that referenced this issue May 19, 2018

Mute testCorruptFileThenSnapshotAndRestore

661fd65

Tracked at #30577

dnhatn self-assigned this May 19, 2018

dnhatn mentioned this issue May 20, 2018

[CI] CorruptedFileIT.testCorruptPrimaryNoReplica #30741

Closed

dnhatn mentioned this issue May 22, 2018

Upgrade to Lucene-7.4.0-snapshot-cc2ee23050 #30778

Merged

dnhatn closed this as completed in #30778 May 22, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

jakelandis mentioned this issue Apr 15, 2019

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore test failure #41201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

romseygeek commented May 14, 2018 •

edited by tlrx

Loading

elasticmachine commented May 14, 2018

tlrx commented May 17, 2018

ywelsch commented May 17, 2018

dnhatn commented May 18, 2018

ywelsch commented May 18, 2018

tlrx commented May 18, 2018

dnhatn commented May 18, 2018

dnhatn commented May 19, 2018

dnhatn commented May 20, 2018 •

edited

Loading

tlrx commented May 22, 2018

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

[CI] CorruptedFileIT.testCorruptFileThenSnapshotAndRestore failure #30577

Comments

romseygeek commented May 14, 2018 • edited by tlrx Loading

elasticmachine commented May 14, 2018

tlrx commented May 17, 2018

ywelsch commented May 17, 2018

dnhatn commented May 18, 2018

ywelsch commented May 18, 2018

tlrx commented May 18, 2018

dnhatn commented May 18, 2018

dnhatn commented May 19, 2018

dnhatn commented May 20, 2018 • edited Loading

tlrx commented May 22, 2018

romseygeek commented May 14, 2018 •

edited by tlrx

Loading

dnhatn commented May 20, 2018 •

edited

Loading