Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When CCR disallow auto follow leader system indices should show error #81238

Open
Leaf-Lin opened this issue Dec 2, 2021 · 3 comments
Open
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team

Comments

@Leaf-Lin
Copy link
Contributor

Leaf-Lin commented Dec 2, 2021

Elasticsearch version (bin/elasticsearch --version): 8.0.0-beta1

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:
From #72815 where it disallows CCR to auto follow leader system indices.
This is fine, except that one can still use the API or Kibana UI to create auto follow pattern which contains system indices.

Internally, Elasticsearch prevents system indices getting replicated, but there's no error message associated with the API or UI makes user unable to figure out why system indices were not replicated.

Steps to reproduce:

  1. Create two deployments clusterA and clusterB both on 8.0.0.
  2. Setup CCR so that clusterA can access remote clusterB.
  3. On clsuterA, delete .kibana_8.0.0_001.
  4. Create an auto_follow pattern on clusterA specify leader system indices from clusterB:
PUT /_ccr/auto_follow/kibana
{
  "remote_cluster": "clusterB",
  "leader_index_patterns": [
    ".kibana_8.0.0_001"
  ],
  "leader_index_exclusion_patterns": [],
  "follow_index_pattern": "{{leader_index}}"
}

--> Expected behavior is that an error message should appear saying .kibana_8.0.0_001 is a system index so that the operation will not succeed.
--> Reality is that it returns the following as if this was accepted.

{
  "acknowledged" : true
}
@Leaf-Lin Leaf-Lin added :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team needs:triage Requires assignment of a team area label labels Dec 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@Leaf-Lin Leaf-Lin added >bug and removed needs:triage Requires assignment of a team area label labels Dec 2, 2021
@Leaf-Lin Leaf-Lin added the good first issue low hanging fruit label May 12, 2022
@Leaf-Lin
Copy link
Contributor Author

Leaf-Lin commented Jun 7, 2022

I just encountered exactly the same issue, but for a different category of indices where CCR failed to report an error at the time when setting up auto_follow for searchable snapshot indices.

Prerequisite

  1. First set up two clusters. On the leader cluster config hot, cold and frozen tier. On the follower cluster, config a hot tier.
  2. Setting up ILM policies on the leader cluster so that the index will transit from hot to cold to frozen.
### On the leader cluster
PUT _ilm/policy/timeseries_policy
{
  "policy": {
    "phases": {
      "hot": {                                
        "actions": {
          "rollover": {
            "max_age": "1m"
          }
        }
      },
      "cold": {
        "min_age": "5m",         
        "actions": {
          "searchable_snapshot" : {
            "snapshot_repository" : "found-snapshots"
          }                        
        }
      },
      "frozen": {
        "min_age": "10m",         
        "actions": {
          "searchable_snapshot" : {
            "snapshot_repository" : "found-snapshots"
          }                        
        }
      }
    }
  }
}
  1. Create an index template and the index itself on the leader cluster
### On the leader cluster
PUT _index_template/timeseries_template
{
  "index_patterns": ["timeseries-*"],                 
  "template": {
    "settings": {
      "index.lifecycle.name": "timeseries_policy",      
      "index.lifecycle.rollover_alias": "timeseries"
    }
  }
}
### On the leader cluster
PUT timeseries-000001
{
  "aliases": {
    "timeseries": {
      "is_write_index": true
    }
  }
}
### Adjusted the default lifecycle poll interval from 10mins to 10seconds to ensure changes are visible in a shorter time span.
### On the leader cluster
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "10s"
  }
}
  1. Wait for 10+ minutes, you should see indices are created and allocated in different tiers (hot instance-0/cold instance-1/frozen instance-2):
GET _cat/shards/timeseries*?s=index

partial-restored-timeseries-000001 0 p STARTED  0     0b 10.41.0.31  instance-0000000002
partial-restored-timeseries-000002 0 p STARTED  0     0b 10.41.0.31  instance-0000000002
partial-restored-timeseries-000003 0 p STARTED  0     0b 10.41.0.31  instance-0000000002
restored-timeseries-000004         0 p STARTED  0   225b 10.41.0.20  instance-0000000001
restored-timeseries-000005         0 p STARTED  0   225b 10.41.0.20  instance-0000000001
restored-timeseries-000006         0 p STARTED  0   225b 10.41.0.20  instance-0000000001
restored-timeseries-000007         0 p STARTED  1  3.7kb 10.41.0.20  instance-0000000001
timeseries-000008                  0 p STARTED 11 33.9kb 10.41.0.220 instance-0000000000
timeseries-000009                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
timeseries-000010                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
timeseries-000011                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
timeseries-000012                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
  1. Now we are ready to go to the follower cluster and set up some replication.
PUT _cluster/settings
{
  "cluster": {
    "remote": {
      "ccr1": {  <-- the alias used for the leader cluster 
        "mode": "proxy",
        "skip_unavailable": "false",
        "server_name": "xxx.us-central1.gcp.foundit.no",  <--  leader cluster endpoint
        "proxy_socket_connections": "18",
        "proxy_address": "xxx.us-central1.gcp.foundit.no:9400" <-- leader cluster endpoint
      }
    }
  }
}

Expected behaviour (When using ccr/follow)

We noticed that when replicating a cold restored-xxx or frozen partial-restored searchable snapshot index via _ccr/follow, we get the following message which is the expected correct behaviour.

### On the follower cluster
PUT my_cold_index/_ccr/follow
{
  "remote_cluster" : "ccr1",
  "leader_index":"restored-timeseries-000006"
}
----
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "leader index [restored-timeseries-000006] is a searchable snapshot index and cannot be used as a leader index for cross-cluster replication purpose"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "leader index [restored-timeseries-000006] is a searchable snapshot index and cannot be used as a leader index for cross-cluster replication purpose"
  },
  "status": 400
}

Unexpected behaviour (when using with ccr/auto_follow)

However, if I use _ccr/auto_follow, I did not get any error associated with auto following a searchable snapshot index.

### On the follower cluster
PUT /_ccr/auto_follow/timeseries_pattern
{
  "remote_cluster" : "ccr1",
  "leader_index_patterns" :
  [
    "restored-timeseries*"
  ],
  "follow_index_pattern" : "{{leader_index}}-copy" 
}
----
{
  "acknowledged": true <-- I don't want this, I want to see error here.
}

It's possible for me to find out that the above auto_follow is failing due to these being searchable snapshot index, but I have to inspect the error via GET _ccr/stats:

GET _ccr/stats
----
{
  "auto_follow_stats": {
    "number_of_failed_follow_indices": 12,
    "number_of_failed_remote_cluster_state_requests": 0,
    "number_of_successful_follow_indices": 0,
    "recent_auto_follow_errors": [
      {
        "leader_index": "timeseries_pattern:restored-timeseries-000006",
        "timestamp": 1654573096540,
        "auto_follow_exception": {
          "type": "exception",
          "reason": "index to follow [restored-timeseries-000006] is a searchable snapshot index and cannot be used for cross-cluster replication purpose"
        }
      },
...      

@Leaf-Lin Leaf-Lin removed the good first issue low hanging fruit label Jun 15, 2022
@tlrx
Copy link
Member

tlrx commented Aug 9, 2022

I think it works as expected: the auto-follow pattern is not validated using the leader cluster metadata at creation time and as such it is not possible to return an immediate error. The auto-follow API works as a separate process that picks up new indices to follow and as such does not work as a synchronous API like Put Follow where you can expect an immediate success/error response.

We could introduce some kind of validation with the remote cluster metadata when an auto-follow pattern is created but it looks contrary to the asynchronous, separate nature of auto-following. The validation could pass at creation time but the user will have to look at CCR stats to investigate any issues in auto-following anyway.

I wonder if we should instead make CCR auto-following failures more easily discoverable (through Cluster Health API?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants