-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt index after disk full, unclean shutdown / Possible to avoid translog recovery at all? #18972
Comments
@SergiuCip When you say:
do you mean transaction logs or files in the |
Hi. I am working with @SergiuCip on that issue @clintongormley: But anyway, the main INDEX itself is intact (checked with lucene). So even if we don't have translog files anymore for these indices, they SHOULD be able to recover even without these translog when the index is not corrupted. That is, wat we would like to achieve. |
@clintongormley: I am sorry for being ambigue :) what I meant was that we have deleted some logs from /var/log/elasticsearch because they were growing at fast rate because of the recovery process. as @uplexPaul mentioned we do have the translog.ckp but not the translog files itself. When you try to start elasticsearch with those indices it will complain that cannot start due to reasons that I have posted previously in my first post. Any help it is appreciated. Thanks |
Sorry to ask but are you sure you're on 2.3.3? On all nodes? We had a bug in early versions of 2.x which ended up with corrupt transaction logs after disk full, but I haven't heard reports of this since 2.3 |
and one more clarification - you say you do have the translog.ckp file, but the exception indicates differently:
|
@clintongormley Hi and thanks for the update, yes I beg your pardon we were running ES version 2.3.1 when the issue has occurred and then we updated to 2.3.3 @bleskes: So, we have tried everything ....keeping the .ckp file and tried to start ES but no luck...also I have tried to delete the .ckp file...and try to start ES but no luck either |
@SergiuCip I can see some EOF exceptions but I am missing the part where it says which file goes EOF is it possible to get the full exceptions? Do you also have the log from when this thing went disk full? Would be great to get all of the logs and exceptions traces |
@s1monw I will try to see if I can find it in the log where it says EOF, however, for the log when the disk started to become full, I am afraid that I don't have that anymore as that was the first log we started truncating because of fast size increase. I will do my best to provide you with as much logging as possible ;) |
Good morning everyone! Thank you for all the support along the way! As promised I have run the test again in order to have a detailed log and here it is: https://gist.github.com/SergiuCip/86dc26c3f67a526d7e4abc060c327eb4 Also please find attached the .ckp file for this log |
The EOF error occurs when I manually touch the translog-21.tlog file: https://gist.github.com/SergiuCip/984f26e1ac6cae562b35a48be7a27164 |
@SergiuCip thanks for all the files and infos, lemme ask some more questinons:
thanks! |
@s1monw You're welcome and thank you for helping me debug it.
[logstash-2016.05.12][[logstash-2016.05.12][3]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nes Thank you :) |
I assume you mean 2.3.1?
can I get a directory listing of |
@s1monw Thank you, Yes I meant 2.3.1 :) Sorry about that. And I will be sending you the compressed directory through email. Please note that the translog-21.tlog are just touched files by me. The translog directory only contained translog.ckp |
@s1monw And this is what I get when I do not have the translog-21.tlog files touched: [2016-06-21 13:15:11,257][WARN ][indices.cluster ] [Timeslip] [[logstash-2016.05.11][2]] marking and sending shard failed due to [failed recovery] |
what do you mean by touched? |
@s1monw touch translog-21.tlog |
ok there are several things that don't make sense: there is no way you get past this:
if you have the |
@s1monw: |
@s1monw Yes @uplexPaul is right, the index translog directory contained only translog.ckp, the rest of the files inside the translog directory were gone. Therefore this is the issue of not being able to recover. So one of my attempts to 'try' and recover it was to 'touch translog-21.tlog' and see what happens. |
I understand but the translog.ckp files you send me are all fine - they are not broken or anything so I wonder if I got the wrong one? |
you posted an exception that runs into |
I have run now the space without the translog-21.tlog files touched and I get this: Jun 21 14:34:43 log-01.novalocal elasticsearch[20591]: [2016-06-21 14:34:43,085][INFO ][gateway ] [Morph] recovered [1] indices into cluster_state |
@s1monw And the header I have added in order to test it :) was not there at the beginning, I have copied from another index translog file which is working |
@s1monw Just to exclude human mistakes i re-did the same than @SergiuCip already did. Took one of these indexes which don't want to recover on startup and tar-ed everything as i just tried to start it ~5mins ago. Also included the complete logfile. Sent to you via mail. |
thanks guys I am loooking again |
@uplexPaul @SergiuCip what filesystem are you guys running on? is this a shared FS by any chance? |
@s1monw We are not running a shared volume (NFS mounted or smth. like that). Our elasticsearch is on a VM running on a VMWare ESXi-Host in a vSphere-Environment with storage attached (actually don't know how, but shouldn't matter). The VM runs under CentOS Linux release 7.1.1503 (Core) with the following kernel version: 3.10.0-229.20.1.el7.x86_64 |
@uplexPaul @SergiuCip I looked into your files and the |
@s1monw Yes i have an answer. @SergiuCip just catted some other translog files and copied the stuff there. Just forget these files, please - we created them and they were not created by elasticsearch. Its clear for me that you are searching for how this issue could have happened. But shoudln't you also address a fix for the behaviour of the current startup? Shouldn't it be able to correctly startup? And - when you say the translog.ckp-Files are fine. Why is elasticsearch trying to read from "translog-21.tlog" which is apparently not there? Why it's not just skipping this translog recovery. We would really love to access the data anyway and that should be the general goal, isn't it? |
@s1monw Hi! Thank you for all your support and I just want to make few things clear: The zipped directory that I have sent you was my playground and it is a snapshot after I have tried different things such as:
Also @uplexPaul zipped the correct directory with no translog-xx.tlog files in them because there aren't any so represents the status we actually have right now. The index directory which then has the translog directory which holds ONLY a single file: translog.ckp. The problem is that we had:
Therefore, we have upgraded ES from 2.3.1 to 2.3.3 as we thought if it is a bug then it might have been solved in the newest version. However, this was not the case. Everytime we start the ES with these logs it always complaining and the /var/log/elasticsearch/elasticsearch.log becomes huge. I would suggest you to look at the files sent by @uplexPaul as they are the REAL files when the ISSUE has OCCURRED. And also I agree with @uplexPaul why it is still keep on searching and trying to read from the translog-21.tlog? I guess after 3 retries max should give up and just start without that information. Moreover, from my research I could not see any .recovery file in the index itself, so it should not even try to recover it, am I right or I am missing something? |
well there is a file missing that prevents the startup. I can't just ignore that and start up the shard. It's a bug and potential dataloss. I can't tell if this has happened due to some issue on your end or due to the powerloss but I am doing nothing else than looking into this fro 2 days now. I wanna know what it is so I can add some code to prevent the issue and start up the shard. |
@s1monw Thank you very much for all your support it is much appreciated it. Please let me know if you found anything :) |
@s1monw I have one question if I may: Does elasticsearch have a database of some sort to remember the indices as I can see on my ES server some of them are restored (because as I have said, after the problem occurred we have started with a fresh 'data' directory)? Thank you :) |
@s1monw Hi :) have you found something useful so far? Thanks |
I have not found anything that indicates the loss of you translog files sorry. |
@s1monw And can you help us restoring that data? Shouldn't there be some way to skip translog recovery? Am i not right that we have a bunch of useful data when the index itself is not corrupt? We really would love to access that data. Currently we can't :( |
I agree but we don't have a good way to do this. you index IS corrupt you are missing a translog file. I agree this is a theoretical but I won't add any code that skips this sorry. We can't do this, I can't just add code that ignores a bug in the system. We have done this in the past and every time we do it it drops on our feet. It means folks won't report problems they just call APIs and continue. Long term it's not maintainable. I hope we find a good way to help these situations but currently there is no good way. |
@s1monw I totally understand that you don't want to aroundwork the bug. But wouldn't it may be possible to create a little tool used for restoring the data we can use or smth. like that? |
Yes agreed. We plan to add a command line tool which will allow for low-level, expert only, you-will-lose-data actions like this one: replacing a damaged transaction log with a fresh one that allows the data in the index to be recovered. Unfortunately, this tool won't be available in time to be useful for you I think. For now, your best bet is to reindex. |
@clintongormley We cannot "reindex" (i think with reindex u mean re-importing data into a new elasticsearch index) - because the data is lost. Can u point us to lines in source where we would have a good starting point to implement a startup for us? |
In 5.0 we have a tool which can be used to truncate corrupt translog files. This doesn't exist in 2.x but there is a workaround:
NOTE: Any data in the corrupted translog will be lost. |
Hi @clintongormley, I tried the workaround from you last comment on a 2.3.5 and got some issues with that procedur:
So I think the procedure would be:
With this procedure, it's recreating new translog (
The workaround looks promising and will help lot's of people in the same situation(!!!!) which do not have the possibility to migrate to 5.X.
Thanks a lot :) |
Hi @mtarantino I've managed to replicate this. I think your only solution for now is to use the |
Elasticsearch version: 2.3.1 then after trying the steps below we have decided to update to the newest version, so probably the bugs will have been fixed
JVM version: openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
OS version: CentOS 7
Description of the problem including expected versus actual behavior:
The disk space on our Elasticsearch server got full and we had to remove some logs in order to reduce the disk space usage. Moreover, we restarted elasticsearch and once we have done so immediately we have seen that some of our indices were broken (please see attached below the logs).
We have moved the entire elasticsearch 'data' directory out of the way as it was an urgent to have logs and restarted elasticsearch which started with a fresh 'data' directory.
In order to restore the indices we have copied some of them to another Elasticsearch Server because we did not wanted to do this on our Live Elasticsearch server and we did as follows:
also:
Error message:
[2016-06-20 09:01:56,764][WARN ][cluster.action.shard ] [Famine] [logstash-2016.05.11][3] received shard failed for target shard [[logstash-2016.05.11][3], node[d-z_xFoqTrGd-kip-Tly_w], [P], v[3], s[INITIALIZING], a[id=yh9ngGZNQ5CIrq87hxZRTg], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-06-20T09:01:56.636Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]]], indexUUID [xN9jy5ZiRqaALYvkDxdFgA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: Caused by: [logstash-2016.05.11][[logstash-2016.05.11][0]] EngineCreationFailureException[fa
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFac
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:
Jun 20 09:01:49 log-01.novalocal elasticsearch[9894]: at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
[2016-06-20 07:12:24,538][WARN ][cluster.action.shard ] [Brain-Child] [logstash-2016.05.11][3] received shard failed for [logstash-2016.05.11][3], node[FcS6VgojTuijenAeemDE6Q], [P], v[3], s[INITIALIZING], a[id=OdCh5n8GSDeZpiCLW-CKZg], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-06-20T07:12:24.364Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]], indexUUID [xN9jy5ZiRqaALYvkDxdFgA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]
[logstash-2016.05.11][[logstash-2016.05.11][3]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;
Caused by: [logstash-2016.05.11][[logstash-2016.05.11][4]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-2016.05.11/4/translog/translog.ckp];
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
Could someone please inform me what I have done wrong and what could I do to restore my indices ?
Thanks,
Sergiu
The text was updated successfully, but these errors were encountered: