Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt index after disk full, unclean shutdown / Possible to avoid translog recovery at all? #18972

Closed
SergiuCip opened this issue Jun 20, 2016 · 44 comments
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed

Comments

@SergiuCip
Copy link

SergiuCip commented Jun 20, 2016

Elasticsearch version: 2.3.1 then after trying the steps below we have decided to update to the newest version, so probably the bugs will have been fixed

JVM version: openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

OS version: CentOS 7

Description of the problem including expected versus actual behavior:

The disk space on our Elasticsearch server got full and we had to remove some logs in order to reduce the disk space usage. Moreover, we restarted elasticsearch and once we have done so immediately we have seen that some of our indices were broken (please see attached below the logs).

We have moved the entire elasticsearch 'data' directory out of the way as it was an urgent to have logs and restarted elasticsearch which started with a fresh 'data' directory.

In order to restore the indices we have copied some of them to another Elasticsearch Server because we did not wanted to do this on our Live Elasticsearch server and we did as follows:

  1. Stop the ES server
  2. Copy + change permission of the index
  3. Restart ES server

also:

  • removed all the translog files

Error message:

[2016-06-20 09:01:56,764][WARN ][cluster.action.shard ] [Famine] [logstash-2016.05.11][3] received shard failed for target shard [[logstash-2016.05.11][3], node[d-z_xFoqTrGd-kip-Tly_w], [P], v[3], s[INITIALIZING], a[id=yh9ngGZNQ5CIrq87hxZRTg], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-06-20T09:01:56.636Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]]], indexUUID [xN9jy5ZiRqaALYvkDxdFgA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]

Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: Caused by: [logstash-2016.05.11][[logstash-2016.05.11][0]] EngineCreationFailureException[fa
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFac
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)

[2016-06-20 07:12:24,538][WARN ][cluster.action.shard ] [Brain-Child] [logstash-2016.05.11][3] received shard failed for [logstash-2016.05.11][3], node[FcS6VgojTuijenAeemDE6Q], [P], v[3], s[INITIALIZING], a[id=OdCh5n8GSDeZpiCLW-CKZg], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-06-20T07:12:24.364Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]], indexUUID [xN9jy5ZiRqaALYvkDxdFgA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]
[logstash-2016.05.11][[logstash-2016.05.11][3]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;

  • removed the translog.ckp

Caused by: [logstash-2016.05.11][[logstash-2016.05.11][4]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-2016.05.11/4/translog/translog.ckp];
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)

  • removed everything besides translog.ckp in the translog folder
  • we have also used java -cp lucene-core5.4.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex to check our indexes and it said that everything is ok, however still we kept on getting errors.

Could someone please inform me what I have done wrong and what could I do to restore my indices ?

Thanks,
Sergiu

@SergiuCip SergiuCip changed the title Corrupt Indices Corrupt Indices - Full Disk Space Jun 20, 2016
@SergiuCip SergiuCip changed the title Corrupt Indices - Full Disk Space "Corrup index after disk full, unclean shutdown / Possible to avoid translog recovery at all?" Jun 20, 2016
@SergiuCip SergiuCip changed the title "Corrup index after disk full, unclean shutdown / Possible to avoid translog recovery at all?" Corup index after disk full, unclean shutdown / Possible to avoid translog recovery at all? Jun 20, 2016
@SergiuCip SergiuCip changed the title Corup index after disk full, unclean shutdown / Possible to avoid translog recovery at all? Corrupt index after disk full, unclean shutdown / Possible to avoid translog recovery at all? Jun 20, 2016
@clintongormley clintongormley added feedback_needed :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 20, 2016
@clintongormley
Copy link

@SergiuCip When you say:

and we had to remove some logs in order to reduce the disk space usage.

do you mean transaction logs or files in the logs/ directory?

@uplexPaul
Copy link

Hi. I am working with @SergiuCip on that issue

@clintongormley:
He doesn't mean that we removed translog files. Just forget the sentence ;)
But it seems like we somehow lost translog files by moving indexes around or whatever.

But anyway, the main INDEX itself is intact (checked with lucene). So even if we don't have translog files anymore for these indices, they SHOULD be able to recover even without these translog when the index is not corrupted. That is, wat we would like to achieve.

@SergiuCip
Copy link
Author

@clintongormley: I am sorry for being ambigue :) what I meant was that we have deleted some logs from /var/log/elasticsearch because they were growing at fast rate because of the recovery process. as @uplexPaul mentioned we do have the translog.ckp but not the translog files itself.

When you try to start elasticsearch with those indices it will complain that cannot start due to reasons that I have posted previously in my first post.

Any help it is appreciated.

Thanks

@clintongormley
Copy link

Sorry to ask but are you sure you're on 2.3.3? On all nodes? We had a bug in early versions of 2.x which ended up with corrupt transaction logs after disk full, but I haven't heard reports of this since 2.3

@bleskes
Copy link
Contributor

bleskes commented Jun 20, 2016

and one more clarification - you say you do have the translog.ckp file, but the exception indicates differently:

NoSuchFileException[/var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-2016.05.11/4/translog/translog.ckp];

@SergiuCip
Copy link
Author

@clintongormley Hi and thanks for the update, yes I beg your pardon we were running ES version 2.3.1 when the issue has occurred and then we updated to 2.3.3

@bleskes: So, we have tried everything ....keeping the .ckp file and tried to start ES but no luck...also I have tried to delete the .ckp file...and try to start ES but no luck either

@s1monw
Copy link
Contributor

s1monw commented Jun 20, 2016

@SergiuCip I can see some EOF exceptions but I am missing the part where it says which file goes EOF is it possible to get the full exceptions? Do you also have the log from when this thing went disk full? Would be great to get all of the logs and exceptions traces

@SergiuCip
Copy link
Author

@s1monw I will try to see if I can find it in the log where it says EOF, however, for the log when the disk started to become full, I am afraid that I don't have that anymore as that was the first log we started truncating because of fast size increase.

I will do my best to provide you with as much logging as possible ;)

@SergiuCip
Copy link
Author

Good morning everyone!

Thank you for all the support along the way! As promised I have run the test again in order to have a detailed log and here it is: https://gist.github.com/SergiuCip/86dc26c3f67a526d7e4abc060c327eb4

Also please find attached the .ckp file for this log
translog.ckp.zip

@SergiuCip
Copy link
Author

SergiuCip commented Jun 21, 2016

The EOF error occurs when I manually touch the translog-21.tlog file:

https://gist.github.com/SergiuCip/984f26e1ac6cae562b35a48be7a27164

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

@SergiuCip thanks for all the files and infos, lemme ask some more questinons:

  • Do you still see a translog-20.ckp file sitting around and is there a translog-20.tlog file as well?
  • when you hit the disk full outage which exact ES version did you run sicne this looks pretty much like this Never delete translog-N.tlog file when creation fails #15788
  • when you say the The EOF error occurs when I manually touch the translog-21.tlog file: does this mean the translog-21.tlog file didn't exists and you added it?
  • the original exception was a FileNotFound is this correct?

thanks!

@SergiuCip
Copy link
Author

SergiuCip commented Jun 21, 2016

@s1monw You're welcome and thank you for helping me debug it.

  1. No I cannot see the translog-20.ckp file anywhere...I had to create it manually but again it complained.
  2. When the issue happen I was running version 3.2.1
  3. Yes it is correct. When you add the broken index these are the first lines of the log:

[logstash-2016.05.12][[logstash-2016.05.12][3]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nes
ted: NoSuchFileException[/home/scorn/test_es/data/elasticsearch/nodes/0/indices/logstash-2016.05.12/3/translog/translog-21.tlog];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [logstash-2016.05.12][[logstash-2016.05.12][3]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/home/scorn/test_es/data/elasticse
arch/nodes/0/indices/logstash-2016.05.12/3/translog/translog-21.tlog];
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
... 5 more
Caused by: java.nio.file.NoSuchFileException: /home/scorn/test_es/data/elasticsearch/nodes/0/indices/logstash-2016.05.12/3/translog/translog-21.tlog
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at org.elasticsearch.index.translog.Translog.openReader(Translog.java:374)
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:334)
at org.elasticsearch.index.translog.Translog.(Translog.java:179)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
... 11 more
[2016-06-21 09:41:30,949][WARN ][cluster.action.shard ] [Advisor] [logstash-2016.05.12][3] received shard failed for target shard [[logstash-2016.05.12][3], node[Tgnuvv4DSTut
gGCRQpR0ig], [P], v[3], s[INITIALIZING], a[id=ekDqqz2BQEi0ur8Fi7Ow1w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-06-21T09:41:30.116Z]]], indexUUID [HLREui0SRAO1LZt8ybnT
WA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: No
SuchFileException[/home/scorn/test_es/data/elasticsearch/nodes/0/indices/logstash-2016.05.12/3/translog/translog-21.tlog]; ]
[logstash-2016.05.12][[logstash-2016.05.12][3]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nes
ted: NoSuchFileException[/home/scorn/test_es/data/elasticsearch/nodes/0/indices/logstash-2016.05.12/3/translog/translog-21.tlog];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [logstash-2016.05.12][[logstash-2016.05.12][3]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/home/scorn/test_es/data/elasticse
arch/nodes/0/indices/logstash-2016.05.12/3/translog/translog-21.tlog];

Thank you :)

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

When the issue happen I was running version 3.2.1

I assume you mean 2.3.1?

No I cannot see the translog-20.ckp file anywhere...I had to create it manually but again it complained.

can I get a directory listing of home/scorn/test_es/data/elasticsearch/nodes/0/indices/logstash-2016.05.12/3/translog/ or can you maybe even zip it?

@SergiuCip
Copy link
Author

SergiuCip commented Jun 21, 2016

@s1monw Thank you,

Yes I meant 2.3.1 :) Sorry about that.

And I will be sending you the compressed directory through email.

Please note that the translog-21.tlog are just touched files by me. The translog directory only contained translog.ckp

@SergiuCip
Copy link
Author

@s1monw And this is what I get when I do not have the translog-21.tlog files touched:

[2016-06-21 13:15:11,257][WARN ][indices.cluster ] [Timeslip] [[logstash-2016.05.11][2]] marking and sending shard failed due to [failed recovery]
[logstash-2016.05.11][[logstash-2016.05.11][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nes
ted: EOFException;
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [logstash-2016.05.11][[logstash-2016.05.11][2]] EngineCreationFailureException[failed to create engine]; nested: EOFException;
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
... 5 more
Caused by: java.io.EOFException
at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
at org.elasticsearch.index.translog.Checkpoint.(Checkpoint.java:54)
at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:83)
at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1930)
at org.elasticsearch.index.translog.Translog.(Translog.java:163)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
... 11 more
[2016-06-21 13:15:11,263][WARN ][cluster.action.shard ] [Timeslip] [logstash-2016.05.11][2] received shard failed for target shard [[logstash-2016.05.11][2], node[Ja7_6NXIR8i
EzqxDnLi2Yg], [P], v[3], s[INITIALIZING], a[id=KCrRrfxmRA-O_bJcR3rp_w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-06-21T13:15:10.560Z]]], indexUUID [HLREui0SRAO1LZt8ybn
TWA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: E
OFException; ]
[logstash-2016.05.11][[logstash-2016.05.11][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nes
ted: EOFException;

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

@s1monw And this is what I get when I do not have the translog-21.tlog files touched:

what do you mean by touched?

@SergiuCip
Copy link
Author

@s1monw touch translog-21.tlog

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

ok there are several things that don't make sense:

there is no way you get past this:

Caused by: java.io.EOFException
at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
at org.elasticsearch.index.translog.Checkpoint.(Checkpoint.java:54)
at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:83)
at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1930)
at org.elasticsearch.index.translog.Translog.(Translog.java:163)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)

if you have thetranslog-21.tlog file or not given the file is really truncated so I wonder what your magic touch does. Also the files I got are via email have several translog-21.tlog files but they are not empty so there must be more to it than a touch? is it possible to get the original files not the ones you messed around with?

@uplexPaul
Copy link

@s1monw:
The original .tlog-Files are all lost. As we already said, we only have the ORIGINAL translog.ckp. @SergiuCip just tried to touch the file after elasticsearch was complaining about it couldn't find it. They were not there before. Possibly we lost them - we don't have them - we can't send them to you, sorry.
The files you have in your archive are really only the result of "touch translog-21.tlog"

@SergiuCip
Copy link
Author

SergiuCip commented Jun 21, 2016

@s1monw Yes @uplexPaul is right, the index translog directory contained only translog.ckp, the rest of the files inside the translog directory were gone. Therefore this is the issue of not being able to recover.

So one of my attempts to 'try' and recover it was to 'touch translog-21.tlog' and see what happens.

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

@s1monw Yes @uplexPaul is right, the index translog directory contained only translog.ckp, the rest of the files inside the translog directory were gone. Therefore this is the issue of not being able to recover.

I understand but the translog.ckp files you send me are all fine - they are not broken or anything so I wonder if I got the wrong one?

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

you posted an exception that runs into EOFException and this is on a translog.ckp file. I read all of them and they are all good. Also the translog-21.tlog contain the translog headers so they must have been created by elasticsearch hence my confusion

@SergiuCip
Copy link
Author

I have run now the space without the translog-21.tlog files touched and I get this:

Jun 21 14:34:43 log-01.novalocal elasticsearch[20591]: [2016-06-21 14:34:43,085][INFO ][gateway ] [Morph] recovered [1] indices into cluster_state
Jun 21 14:34:43 log-01.novalocal elasticsearch[20591]: [2016-06-21 14:34:43,087][INFO ][gateway ] [Morph] auto importing dangled indices [logstash-2016.05.12/OPE
[scorn@log-01 0]$ sudo journalctl -eu elasticsearch_scorn
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: ... 5 more
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: Caused by: java.io.EOFException
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Checkpoint.(Checkpoint.java:54)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:83)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1930)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Translog.(Translog.java:163)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: ... 11 more
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: [2016-06-21 14:34:44,811][WARN ][cluster.action.shard ] [Morph] [logstash-2016.05.12][1] received shard failed for targ
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: [logstash-2016.05.12][[logstash-2016.05.12][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: Engin
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at java.lang.Thread.run(Thread.java:745)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: Caused by: [logstash-2016.05.12][[logstash-2016.05.12][1]] EngineCreationFailureException[failed to create engine]; nested:
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: ... 5 more
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: Caused by: java.io.EOFException
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Checkpoint.(Checkpoint.java:54)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:83)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1930)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.translog.Translog.(Translog.java:163)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: ... 11 more
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: [2016-06-21 14:34:44,811][WARN ][indices.cluster ] [Morph] [[logstash-2016.05.12][3]] marking and sending shard fa
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: [logstash-2016.05.12][[logstash-2016.05.12][3]] IndexShardRecoveryException[failed to recovery from gateway]; nested: Engin
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jun 21 14:34:44 log-01.novalocal elasticsearch[20591]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

@SergiuCip
Copy link
Author

SergiuCip commented Jun 21, 2016

@s1monw And the header I have added in order to test it :) was not there at the beginning, I have copied from another index translog file which is working

@uplexPaul
Copy link

@s1monw Just to exclude human mistakes i re-did the same than @SergiuCip already did. Took one of these indexes which don't want to recover on startup and tar-ed everything as i just tried to start it ~5mins ago. Also included the complete logfile. Sent to you via mail.

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

thanks guys I am loooking again

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2016

@uplexPaul @SergiuCip what filesystem are you guys running on? is this a shared FS by any chance?

@uplexPaul
Copy link

@s1monw We are not running a shared volume (NFS mounted or smth. like that).

Our elasticsearch is on a VM running on a VMWare ESXi-Host in a vSphere-Environment with storage attached (actually don't know how, but shouldn't matter).

The VM runs under CentOS Linux release 7.1.1503 (Core) with the following kernel version: 3.10.0-229.20.1.el7.x86_64

@s1monw
Copy link
Contributor

s1monw commented Jun 22, 2016

@uplexPaul @SergiuCip I looked into your files and the translog.ckp files are all fine. Now the question is how you got into this state and I don't have an answer to this.
Yet, I am still puzzled about the content of the translog-21.tlog files you send me since you said they where just touched. When I touch a file it has 0 bytes but the files I got from you have ~33 bytes which are actually ~33 valid bytes (the header). Do you have a good answer how they showed up half written?

@uplexPaul
Copy link

@s1monw Yes i have an answer. @SergiuCip just catted some other translog files and copied the stuff there. Just forget these files, please - we created them and they were not created by elasticsearch.

Its clear for me that you are searching for how this issue could have happened. But shoudln't you also address a fix for the behaviour of the current startup? Shouldn't it be able to correctly startup? And - when you say the translog.ckp-Files are fine. Why is elasticsearch trying to read from "translog-21.tlog" which is apparently not there? Why it's not just skipping this translog recovery. We would really love to access the data anyway and that should be the general goal, isn't it?

@SergiuCip
Copy link
Author

SergiuCip commented Jun 22, 2016

@s1monw Hi! Thank you for all your support and I just want to make few things clear:

The zipped directory that I have sent you was my playground and it is a snapshot after I have tried different things such as:

  1. touch translog-21.tlog -> as you have pointed out.
  2. because it did not worked and ES was still complaining I have decided to 'cat' a translog from the indexes which are working (once we have removed the directory out of the way and let ES create a FRESH data directory) so I have copied the content in the translog-21.tlog file in order to try and see if it works. It did not worked either.

Also @uplexPaul zipped the correct directory with no translog-xx.tlog files in them because there aren't any so represents the status we actually have right now. The index directory which then has the translog directory which holds ONLY a single file: translog.ckp.

The problem is that we had:

  1. Stopped ES
  2. cp logstash-2016-05-12 /elasticsearch/main_directory/data/
  3. Start ES
  4. Get lots of ERROR messages saying that the translog-21.tlog is missing

Therefore, we have upgraded ES from 2.3.1 to 2.3.3 as we thought if it is a bug then it might have been solved in the newest version. However, this was not the case.

Everytime we start the ES with these logs it always complaining and the /var/log/elasticsearch/elasticsearch.log becomes huge.

I would suggest you to look at the files sent by @uplexPaul as they are the REAL files when the ISSUE has OCCURRED.

And also I agree with @uplexPaul why it is still keep on searching and trying to read from the translog-21.tlog? I guess after 3 retries max should give up and just start without that information. Moreover, from my research I could not see any .recovery file in the index itself, so it should not even try to recover it, am I right or I am missing something?

@s1monw
Copy link
Contributor

s1monw commented Jun 22, 2016

Shouldn't it be able to correctly startup?

well there is a file missing that prevents the startup. I can't just ignore that and start up the shard. It's a bug and potential dataloss. I can't tell if this has happened due to some issue on your end or due to the powerloss but I am doing nothing else than looking into this fro 2 days now. I wanna know what it is so I can add some code to prevent the issue and start up the shard.

@SergiuCip
Copy link
Author

@s1monw Thank you very much for all your support it is much appreciated it. Please let me know if you found anything :)

@SergiuCip
Copy link
Author

SergiuCip commented Jun 23, 2016

@s1monw I have one question if I may: Does elasticsearch have a database of some sort to remember the indices as I can see on my ES server some of them are restored (because as I have said, after the problem occurred we have started with a fresh 'data' directory)?

Thank you :)

@SergiuCip
Copy link
Author

@s1monw Hi :) have you found something useful so far?

Thanks

@s1monw
Copy link
Contributor

s1monw commented Jun 27, 2016

@s1monw Hi :) have you found something useful so far?

I have not found anything that indicates the loss of you translog files sorry.

@uplexPaul
Copy link

@s1monw And can you help us restoring that data? Shouldn't there be some way to skip translog recovery? Am i not right that we have a bunch of useful data when the index itself is not corrupt? We really would love to access that data. Currently we can't :(

@s1monw
Copy link
Contributor

s1monw commented Jun 27, 2016

@s1monw And can you help us restoring that data? Shouldn't there be some way to skip translog recovery? Am i not right that we have a bunch of useful data when the index itself is not corrupt? We really would love to access that data. Currently we can't :(

I agree but we don't have a good way to do this. you index IS corrupt you are missing a translog file. I agree this is a theoretical but I won't add any code that skips this sorry. We can't do this, I can't just add code that ignores a bug in the system. We have done this in the past and every time we do it it drops on our feet. It means folks won't report problems they just call APIs and continue. Long term it's not maintainable. I hope we find a good way to help these situations but currently there is no good way.

@uplexPaul
Copy link

@s1monw I totally understand that you don't want to aroundwork the bug. But wouldn't it may be possible to create a little tool used for restoring the data we can use or smth. like that?
Now we are in a very bad situation because no one knows how it's crashed - but again in my mind at least the user (me in this case) should have the option to access / restore at least the portion of usable data which is still there in the (not corrupt) lucene index.

@clintongormley
Copy link

But wouldn't it may be possible to create a little tool used for restoring the data we can use or smth. like that?

Yes agreed. We plan to add a command line tool which will allow for low-level, expert only, you-will-lose-data actions like this one: replacing a damaged transaction log with a fresh one that allows the data in the index to be recovered.

Unfortunately, this tool won't be available in time to be useful for you I think. For now, your best bet is to reindex.

@uplexPaul
Copy link

@clintongormley We cannot "reindex" (i think with reindex u mean re-importing data into a new elasticsearch index) - because the data is lost. Can u point us to lines in source where we would have a good starting point to implement a startup for us?

@clintongormley
Copy link

In 5.0 we have a tool which can be used to truncate corrupt translog files. This doesn't exist in 2.x but there is a workaround:

POST my_index/_close

PUT my_index/_settings
{ "index.engine.force_new_translog": true }

POST my_index/_open

PUT my_index/_settings
{ "index.engine.force_new_translog": false }

NOTE: Any data in the corrupted translog will be lost.

@mtarantino
Copy link

Hi @clintongormley,

I tried the workaround from you last comment on a 2.3.5 and got some issues with that procedur:

PUT my_index/_settings { "index.engine.force_new_translog": false } is not allowed to be changed when the index is open:

Can't update non dynamic settings[[index.engine.force_new_translog]] for open indices [[my_index]]"

So I think the procedure would be:

POST my_index/_close
PUT my_index/_settings { "index.engine.force_new_translog": true }
POST my_index/_open
POST my_index/_close
PUT my_index/_settings { "index.engine.force_new_translog": false }
POST my_index/_open

With this procedure, it's recreating new translog ( translog-1.tlog translog.ckp ) but after doing the last _open, it's not able to open the shards because of this translog file belongs to a different translog :

[2016-08-19 22:40:42,535][WARN ][indices.cluster          ] [Hyperion] [[my_index][2]] marking and sending shard failed due to [failed recovery]
[my_index][[my_index][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: TranslogCorruptedException[expected shard UUID [[4d 78 6a 58 72 41 65 43 53 68 53 49 78 66 42 51 51 62 45 76 45 77]] but got: [[55 61 6a 72 50 66 78 34 51 49 43 75 38 57 75 46 7a 44 55 39 4d 67]] this translog file belongs to a different translog];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
    at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
    at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [my_index][[my_index][2]] EngineCreationFailureException[failed to create engine]; nested: TranslogCorruptedException[expected shard UUID [[4d 78 6a 58 72 41 65 43 53 68 53 49 78 66 42 51 51 62 45 76 45 77]] but got: [[55 61 6a 72 50 66 78 34 51 49 43 75 38 57 75 46 7a 44 55 39 4d 67]] this translog file belongs to a different translog];
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:155)
    at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
    at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
    at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
    at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
    at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
    ... 5 more
Caused by: TranslogCorruptedException[expected shard UUID [[4d 78 6a 58 72 41 65 43 53 68 53 49 78 66 42 51 51 62 45 76 45 77]] but got: [[55 61 6a 72 50 66 78 34 51 49 43 75 38 57 75 46 7a 44 55 39 4d 67]] this translog file belongs to a different translog]
    at org.elasticsearch.index.translog.TranslogReader.open(TranslogReader.java:235)
    at org.elasticsearch.index.translog.Translog.openReader(Translog.java:377)
    at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:334)
    at org.elasticsearch.index.translog.Translog.<init>(Translog.java:179)
    at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:151)
    ... 11 more

The workaround looks promising and will help lot's of people in the same situation(!!!!) which do not have the possibility to migrate to 5.X.
Could we still try to:

Thanks a lot :)

@clintongormley
Copy link

Hi @mtarantino

I've managed to replicate this. I think your only solution for now is to use the index.engine.force_new_translog setting to open your index, then reindex the data into a new index and delete the old one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed
Projects
None yet
Development

No branches or pull requests

6 participants