When CCR disallow auto follow leader system indices should show error #81238

Leaf-Lin · 2021-12-02T06:35:21Z

Elasticsearch version (bin/elasticsearch --version): 8.0.0-beta1

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:
From #72815 where it disallows CCR to auto follow leader system indices.
This is fine, except that one can still use the API or Kibana UI to create auto follow pattern which contains system indices.

Internally, Elasticsearch prevents system indices getting replicated, but there's no error message associated with the API or UI makes user unable to figure out why system indices were not replicated.

Steps to reproduce:

Create two deployments clusterA and clusterB both on 8.0.0.
Setup CCR so that clusterA can access remote clusterB.
On clsuterA, delete .kibana_8.0.0_001.
Create an auto_follow pattern on clusterA specify leader system indices from clusterB:

PUT /_ccr/auto_follow/kibana
{
  "remote_cluster": "clusterB",
  "leader_index_patterns": [
    ".kibana_8.0.0_001"
  ],
  "leader_index_exclusion_patterns": [],
  "follow_index_pattern": "{{leader_index}}"
}

--> Expected behavior is that an error message should appear saying .kibana_8.0.0_001 is a system index so that the operation will not succeed.
--> Reality is that it returns the following as if this was accepted.

{
  "acknowledged" : true
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-12-02T06:35:23Z

Pinging @elastic/es-distributed (Team:Distributed)

Leaf-Lin · 2022-06-07T04:19:29Z

I just encountered exactly the same issue, but for a different category of indices where CCR failed to report an error at the time when setting up auto_follow for searchable snapshot indices.

Prerequisite

First set up two clusters. On the leader cluster config hot, cold and frozen tier. On the follower cluster, config a hot tier.
Setting up ILM policies on the leader cluster so that the index will transit from hot to cold to frozen.

### On the leader cluster
PUT _ilm/policy/timeseries_policy
{
  "policy": {
    "phases": {
      "hot": {                                
        "actions": {
          "rollover": {
            "max_age": "1m"
          }
        }
      },
      "cold": {
        "min_age": "5m",         
        "actions": {
          "searchable_snapshot" : {
            "snapshot_repository" : "found-snapshots"
          }                        
        }
      },
      "frozen": {
        "min_age": "10m",         
        "actions": {
          "searchable_snapshot" : {
            "snapshot_repository" : "found-snapshots"
          }                        
        }
      }
    }
  }
}

Create an index template and the index itself on the leader cluster

### On the leader cluster
PUT _index_template/timeseries_template
{
  "index_patterns": ["timeseries-*"],                 
  "template": {
    "settings": {
      "index.lifecycle.name": "timeseries_policy",      
      "index.lifecycle.rollover_alias": "timeseries"
    }
  }
}
### On the leader cluster
PUT timeseries-000001
{
  "aliases": {
    "timeseries": {
      "is_write_index": true
    }
  }
}
### Adjusted the default lifecycle poll interval from 10mins to 10seconds to ensure changes are visible in a shorter time span.
### On the leader cluster
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "10s"
  }
}

Wait for 10+ minutes, you should see indices are created and allocated in different tiers (hot instance-0/cold instance-1/frozen instance-2):

GET _cat/shards/timeseries*?s=index

partial-restored-timeseries-000001 0 p STARTED  0     0b 10.41.0.31  instance-0000000002
partial-restored-timeseries-000002 0 p STARTED  0     0b 10.41.0.31  instance-0000000002
partial-restored-timeseries-000003 0 p STARTED  0     0b 10.41.0.31  instance-0000000002
restored-timeseries-000004         0 p STARTED  0   225b 10.41.0.20  instance-0000000001
restored-timeseries-000005         0 p STARTED  0   225b 10.41.0.20  instance-0000000001
restored-timeseries-000006         0 p STARTED  0   225b 10.41.0.20  instance-0000000001
restored-timeseries-000007         0 p STARTED  1  3.7kb 10.41.0.20  instance-0000000001
timeseries-000008                  0 p STARTED 11 33.9kb 10.41.0.220 instance-0000000000
timeseries-000009                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
timeseries-000010                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
timeseries-000011                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000
timeseries-000012                  0 p STARTED  0   225b 10.41.0.220 instance-0000000000

Now we are ready to go to the follower cluster and set up some replication.

PUT _cluster/settings
{
  "cluster": {
    "remote": {
      "ccr1": {  <-- the alias used for the leader cluster 
        "mode": "proxy",
        "skip_unavailable": "false",
        "server_name": "xxx.us-central1.gcp.foundit.no",  <--  leader cluster endpoint
        "proxy_socket_connections": "18",
        "proxy_address": "xxx.us-central1.gcp.foundit.no:9400" <-- leader cluster endpoint
      }
    }
  }
}

Expected behaviour (When using ccr/follow)

We noticed that when replicating a cold restored-xxx or frozen partial-restored searchable snapshot index via _ccr/follow, we get the following message which is the expected correct behaviour.

### On the follower cluster
PUT my_cold_index/_ccr/follow
{
  "remote_cluster" : "ccr1",
  "leader_index":"restored-timeseries-000006"
}
----
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "leader index [restored-timeseries-000006] is a searchable snapshot index and cannot be used as a leader index for cross-cluster replication purpose"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "leader index [restored-timeseries-000006] is a searchable snapshot index and cannot be used as a leader index for cross-cluster replication purpose"
  },
  "status": 400
}

Unexpected behaviour (when using with ccr/auto_follow)

However, if I use _ccr/auto_follow, I did not get any error associated with auto following a searchable snapshot index.

### On the follower cluster
PUT /_ccr/auto_follow/timeseries_pattern
{
  "remote_cluster" : "ccr1",
  "leader_index_patterns" :
  [
    "restored-timeseries*"
  ],
  "follow_index_pattern" : "{{leader_index}}-copy" 
}
----
{
  "acknowledged": true <-- I don't want this, I want to see error here.
}

It's possible for me to find out that the above auto_follow is failing due to these being searchable snapshot index, but I have to inspect the error via GET _ccr/stats:

GET _ccr/stats
----
{
  "auto_follow_stats": {
    "number_of_failed_follow_indices": 12,
    "number_of_failed_remote_cluster_state_requests": 0,
    "number_of_successful_follow_indices": 0,
    "recent_auto_follow_errors": [
      {
        "leader_index": "timeseries_pattern:restored-timeseries-000006",
        "timestamp": 1654573096540,
        "auto_follow_exception": {
          "type": "exception",
          "reason": "index to follow [restored-timeseries-000006] is a searchable snapshot index and cannot be used for cross-cluster replication purpose"
        }
      },
...

tlrx · 2022-08-09T14:55:56Z

I think it works as expected: the auto-follow pattern is not validated using the leader cluster metadata at creation time and as such it is not possible to return an immediate error. The auto-follow API works as a separate process that picks up new indices to follow and as such does not work as a synchronous API like Put Follow where you can expect an immediate success/error response.

We could introduce some kind of validation with the remote cluster metadata when an auto-follow pattern is created but it looks contrary to the asynchronous, separate nature of auto-following. The validation could pass at creation time but the user will have to look at CCR stats to investigate any issues in auto-following anyway.

I wonder if we should instead make CCR auto-following failures more easily discoverable (through Cluster Health API?)

Leaf-Lin added :Distributed/CCR Issues around the Cross Cluster State Replication features Team:Distributed Meta label for distributed team needs:triage Requires assignment of a team area label labels Dec 2, 2021

Leaf-Lin added >bug and removed needs:triage Requires assignment of a team area label labels Dec 2, 2021

Leaf-Lin mentioned this issue Mar 31, 2022

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

Closed

Leaf-Lin added the good first issue low hanging fruit label May 12, 2022

Leaf-Lin mentioned this issue Jun 7, 2022

[DOCS] Add CCR limitation #87348

Merged

Leaf-Lin removed the good first issue low hanging fruit label Jun 15, 2022

Leaf-Lin mentioned this issue Sep 23, 2022

CCR auto_follow to follow both newly created and existing indices #90281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When CCR disallow auto follow leader system indices should show error #81238

When CCR disallow auto follow leader system indices should show error #81238

Leaf-Lin commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

Leaf-Lin commented Jun 7, 2022

tlrx commented Aug 9, 2022

When CCR disallow auto follow leader system indices should show error #81238

When CCR disallow auto follow leader system indices should show error #81238

Comments

Leaf-Lin commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

Leaf-Lin commented Jun 7, 2022

Prerequisite

Expected behaviour (When using ccr/follow)

Unexpected behaviour (when using with ccr/auto_follow)

tlrx commented Aug 9, 2022