Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot/delete operations are failing with snapshot_missing_exception #104598

Closed
pratik-devops opened this issue Jan 22, 2024 · 1 comment
Closed
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@pratik-devops
Copy link

Elasticsearch Version

7.16.3

Installed Plugins

No response

Java Version

bundled

OS Version

Linux elasticsearch-7c459c9bc5-l88zr 3.10.0-1160.76.1.el7.x86_64 #1 SMP Tue Jul 26 14:15:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

We have a single node ES cluster hosting around 15 aliases and which is working perfectly.
We have developed a backup and restore solution using Curator action which was working perfectly fine.

The solution works as follows:

  1. Take snapshot of Elasticsearch Indices and then use the delete action to maintain the latest 2 snapshots.

actions:
1:
action: snapshot
description: >-
Snapshot selected aliases to 'aaf_backup' with the snapshot name
pattern in 'name'.
options:
repository: aaf_backup
name: curator_snapshot-%Y%m%d%H%M%S
wait_for_completion: True
ignore_unavailable: True
include_global_state: True
max_wait: 3600
wait_interval: 10
filters:
- filtertype: alias
aliases:
- test1
- test2
- testn
exclude: False
2:
action: delete_snapshots
description: "Delete snapshots from 'aaf_backup' repository (exclude latest 2 )."
options:
repository: aaf_backup
ignore_empty_list: True
retry_interval: 120
retry_count: 3
filters:
- filtertype: pattern
kind: prefix
value: curator_
- filtertype: count
count: 2

  1. In case of disaster, we will use a Restore action to restore from the latest snapshot.

Suddenly - We started seeing failures in Snapshot operation as below:

2024-01-19 08:35:05,547 INFO Snapshot curator_snapshot-20240119083505 still in progress.
2024-01-19 08:35:15,562 WARNING Snapshot curator_snapshot-20240119083505 completed with state PARTIAL.
2024-01-19 08:35:15,565 ERROR Snapshot PARTIAL completed with state: PARTIAL
2024-01-19 08:35:15,565 ERROR Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL

on checking the snapshot repo with below curl:

curl -k https://localhost:30920/_snapshot/aaf_backup/_all?pretty
{
"error" : {
"root_cause" : [
{
"type" : "snapshot_missing_exception",
"reason" : "[aaf_backup:curator_snapshot-20230821110005/F4kYVIy9Re6ea90ikPVNrw] is missing"
}
],
"type" : "snapshot_missing_exception",
"reason" : "[aaf_backup:curator_snapshot-20230821110005/F4kYVIy9Re6ea90ikPVNrw] is missing",
"caused_by" : {
"type" : "no_such_file_exception",
"reason" : "/usr/share/elasticsearch/data/mount/backups/snap-F4kYVIy9Re6ea90ikPVNrw.dat"
}
},
"status" : 404
}

$ curl -k https://localhost:30920/_cluster/health?pretty
{
"cluster_name" : "test_elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 119,
"active_shards" : 119,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

There is a WA to make snapshot operation working again, which is to Deregister the repo, delete the data from backup mount and recreate the backup repo and register it. After this it start with BAU work. But after certain interval - It keeps ending with same behavior where snapshot operation starts failing,

Expected Behavior

The above mentioned process where snapshot and delete actions are used.
We run the snapshot creation and old snapshot deletion operation in the interval of 6 hours every day.
The expectation here is it should keep taking the latest snapshot and also maintain the latest two snapshots in the repository.

Actual Behavior

Snapshot process mentioned above works till certain time and one day it starts failing with partial snapshot error:
Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL**
Error in ES logs:

"stacktrace": ["java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/mount/backups/indices/87gNXnEESjCTaTFJqekEuw/0/index-2Q_3S0aYSCubJ39gsp7L9w",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]",
"at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]",
"at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]",
"at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:195) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3416) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2666) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:370) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:267) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-01-19T08:35:06,443Z", "level": "WARN", "component": "o.e.s.SnapshotShardsService", "cluster.name": "test_elasticsearch", "node.name": "test_node", "message": "[[test_index-2023.04.24-000002][0]][aaf_backup:curator_snapshot-20240119083505/Lj_q_BZTQqy9pxiEnbu2Mg] failed to snapshot shard", "cluster.uuid": "lp-MLx_wRN-UeC7oEFTxtw", "node.id": "luXNaXPhSJS2cuNANwi_HQ" ,

As per the above logs it seems it was not able to find the index related file. But as this is the issue in production and the files were never removed manually. So I am suspecting something is going wrong here.

Specifications

ES version: 7.16.3
Curator version: 5.8.4

Context (Environment)

This is causing issues in production, where we end up having corrupted repository of the backups and only solution we have is to recreate it by losing all the backups. and in worst case we will end up having no use of backup as no way to restore the data because of backup repo corruption.

Adding link of issue raised for curator: elastic/curator#1697

Steps to Reproduce

All the details are mentioned above.

Logs (if relevant)

On checking further, In ES logs found below errors:

"stacktrace": ["java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/mount/backups/indices/87gNXnEESjCTaTFJqekEuw/0/index-2Q_3S0aYSCubJ39gsp7L9w",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]",
"at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]",
"at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]",
"at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:195) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3416) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2666) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:370) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:267) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-01-19T08:35:06,443Z", "level": "WARN", "component": "o.e.s.SnapshotShardsService", "cluster.name": "test_elasticsearch", "node.name": "test_node", "message": "[[test_index-2023.04.24-000002][0]][aaf_backup:curator_snapshot-20240119083505/Lj_q_BZTQqy9pxiEnbu2Mg] failed to snapshot shard", "cluster.uuid": "lp-MLx_wRN-UeC7oEFTxtw", "node.id": "luXNaXPhSJS2cuNANwi_HQ" ,

@pratik-devops pratik-devops added >bug needs:triage Requires assignment of a team area label labels Jan 22, 2024
@DaveCTurner
Copy link
Contributor

Thank you very much for your interest in Elasticsearch. Unfortunately the issue you have reported relates to Elasticsearch version 7.16.3 which is very old and has passed end-of-life. We will not investigate issues related to unsupported versions here on Github, so I am closing this to indicate that no action is needed from the Elasticsearch development team. It's possible that you will find a volunteer to help you with this issue on the community forums, but our strong recommendation would be to upgrade to a supported version of Elasticsearch as a matter of some urgency. If you can reproduce your issue on a supported version then please open a fresh bug report.

Quoting the bug report form:

Please also check your OS is supported, and that the version of Elasticsearch has not passed end-of-life. If you are using an unsupported OS or an unsupported version then the issue is likely to be closed.

@DaveCTurner DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants