Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not merge me #2

Closed
wants to merge 20 commits into from
Closed

do not merge me #2

wants to merge 20 commits into from

Conversation

nik9000
Copy link
Owner

@nik9000 nik9000 commented Aug 10, 2016

The standard way to change an index's mapping is to create a new index with the
new mapping, _reindex the documents into the new index, flip the alias from
the old index to the new index, and then remove the old index. Traditionally
this sort of thing has been left as an exercise for those implementing an
application against Elasticsearch but I think now is the time to implement this
in Elasticsearch because:

  1. Watcher and Security need to run this process as part of upgrading to 5.0.
  2. Elasticsearch 5.0 now has the .tasks index for storing the results of
    tasks long running. While we were fairly careful in designing its mappings,
    I'm under no illusion that we got it right the first try. That just isn't the
    way software works. We're going to want to run this on .tasks one day.
  3. Logstash is considering storing configuration in an Elasticsearch index and
    handling upgrades to the format of the data is a concern for Logstash's
    engineers.

In all of these cases the indexes are implementation details of their
application so we'd like to automatically upgrade them on startup rather than
provide upgrade scripts. That means that the application will want to migrate
its data every time it starts up so a user only has to get involved if the data
migration fails.

3 of the 4 applications that will need to do this migration live inside
Elasticsearch (Watcher and Security are a plugin, .tasks is in core
Elasticsearch). So it looks like the right place to implement this is in core
Elasticsearch. The other advantage of implementing it there is that it can be
used by the widest range of users.

This PR intends to build an action into core Elasticsearch that:

  1. Responds quickly with 200 OK when the index is in the desired state
    already.
  2. Waits on concurrent invocations of the same request. This is especially
    important in "masterless" systems like Logstash so they can invoke this API on
    startup and not have to worry about one node "winning". They all get the same
    response.
  3. Notices if previous executions of this request didn't complete properly and
    responds with that information rather than some cryptic failure message.
  4. Performs the create index, migrate documents, flip alias, delete source
    index steps.

It exposes it with an HTTP request that looks like:

POST /index_1/_migrate/index_2
{
  "settings": {...},
  "mapping": {...},
  "aliases": ["index"],
  "script": {
    "lang": "painless",
    "inline": "ctx._source.thing = 2"
  }
}

In this example index_1 is the source index and index_2 is the destination
index. Unlike a normal create index command the aliases section is required.
This is how _migrate knows that the process is complete and it is a good
practice anyway. The alias is added to the destination index after all the docs
in the source index are migrated to the destination index and the destination
index has been _refreshed so they are visible.

Like _reindex and _delete_by_query and _update_by_query, these requests
are "big" in that they do many things and we expect them to take a long time if
they operate on a large number of documents. This can't be helped so we want to
make sure that this request integrates well with the task management API. That
means that it should be "cancellable": true and it's status should be super
expressive, returning the phase of the operation currently being performed and
if that phase is reindex then it needs to return the details of the reindex's
status.

We try to limit the number of "big" operations in core Elasticsearch because
every one of them feels like a new trap we are setting for unsuspecting users.
We will need to warn users that this can take some time and put some load on
the cluster. For the users all the way at the top of the document we don't
expect this to be a problem though. A Security index with a million documents
is huge but not a ton of work for reindex. We just have to make very very
sure that it is obvious to users that doing this against an index with a
hundred million documents is going to take a long time.

@nik9000 nik9000 changed the title tmp do not merge me Aug 10, 2016
@@ -40,7 +41,8 @@
return Arrays.asList(new ActionHandler<>(ReindexAction.INSTANCE, TransportReindexAction.class),
new ActionHandler<>(UpdateByQueryAction.INSTANCE, TransportUpdateByQueryAction.class),
new ActionHandler<>(DeleteByQueryAction.INSTANCE, TransportDeleteByQueryAction.class),
new ActionHandler<>(RethrottleAction.INSTANCE, TransportRethrottleAction.class));
new ActionHandler<>(RethrottleAction.INSTANCE, TransportRethrottleAction.class),
new ActionHandler<>(MigrateIndexAction.INSTANCE, TransportMigrateIndexAction.class));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this transport action in the reindex module while the request and response are in core sucks. We're not really sure what the right way to fix it is because we want other Elasticsearch plugins to be able to use this request and if it is in the reindex plugin this is hard. We want it in the reindex module because it relies on reindex and we don't want to yank reindex into core.

Parsing a search request is currently split up among a number of
classes, using multiple public static methods, which take multiple
regstries of elements that may appear in the search request like query
parsers and aggregations. This change begins consolidating all this code
by collapsing the registries normally used for parsing search requests
into a single SearchRequestParsers class. It is also made available to
plugin services to enable templating of search requests.  Eventually all
of the actual parsing logic should move to the class, and the registries
should be hidden, but for now they are at least co-located to reduce the
number of objects that must be passed around.
@nik9000 nik9000 force-pushed the index_migrate branch 2 times, most recently from e2854c3 to f507ca7 Compare August 16, 2016 17:10
@nik9000 nik9000 closed this Aug 17, 2016
@nik9000
Copy link
Owner Author

nik9000 commented Aug 18, 2016

I've moved this to elastic#20024

nik9000 pushed a commit that referenced this pull request Jul 26, 2017
…point into lucene (elastic#25827)

When a replica processes out of order operations, it can drop some due to version comparisons. In the past that would have resulted in a VersionConflictException being thrown and the operation was totally ignored. With the seq# push, we started storing these operations in the translog (but not indexing them into lucene) in order to have complete op histories to facilitate ops based recoveries. This in turn had the undesired effect that deleted docs may be resurrected during recovery in some extreme edge situation (see a complete explanation below). This PR contains a simple fix, which is also an optimization for the recovery process, incoming operation that have a seq# lower than the current local checkpoint (i.e., have already been processed) should not be indexed into lucene. Note that sometimes we can also skip storing them in the translog, but this is not required for the fix and is more complicated.

This is the equivalent of elastic#25592

## More details on resurrected ops 

Consider two operations: 
 - Index d1, seq no 1
 - Delete d1, seq no 3

On a replica they come out of order:
 - Translog gen 1 contains:
    - delete (seqNo 3)
 - Translog gen 2 contains:
    - index (seqNo 1) (wasn't indexed into lucene, but put into the translog)
    - another operation (seqNo 10)
 - Translog gen 3 
    - another op (seqNo 9)
 - Engine commits with:
    - local checkpoint 9
    - refers to gen 2 

If this replica becomes a primary:
    - Local recovery will replay translog gen 2 and up, causing index #1 to be re-index. 
    - Even if recovery will start at gen 3, the translog retention policy will cause file based recovery to replay the entire translog. If it happens to start at gen 2 (but not 1), we will run into the same problem.

#### Some context - out of order delivery involving deletes:

On normal operations, this relies on the gc_deletes setting. We assume that the setting represents an upper bound on the time between the index and the delete operation. The index operation will be detected as stale based on the tombstone map in the LiveVersionMap.

Recovery presents a challenge as it can replay an old index operation that was in the translog and override a delete operation that was done when the engine was opened (and is not part of the replayed snapshot). To deal with this situation, we disable GC deletes (i.e. retain all deletes) for the duration of recoveries. This means that the delete operation will be remembered and the index operation ignored.

Both of the above scenarios (local recover + peer recovery) create a situation where the delete operation is never replayed. It this "lost" as lucene doesn't remember it happened and our LiveVersionMap is populated with it.

#### Solution:

Note that both local and peer recovery represent a scenario where we replay translog ops on top of an existing lucene index, potentially with ongoing indexing. Therefore we can treat them the same.

The local checkpoint in Lucene represent a marker indicating that all operations below it were performed on the index. This is the only form of "memory" that we have that relates to deletes. If we can achieve the following:
1) All ops below the local checkpoint are not indexed to lucene.
2) All ops above the local checkpoint are

It will mean that all  variants are covered: (i# == index op seq#, d# == delete op seq#, lc == local checkpoint in commit)
1) i# < d# <= lc - document is already deleted in lucene and stays that way.
2) i# <= lc < d# - delete is replayed on index - document is deleted
3) lc < i# < d# - index is replayed and then delete - document is deleted.

More formally - we want to make sure that for all ops that performed on the primary o1 and o2, if o2 is processed on a shard before o1, o1 will be dropped. We have the following scenarios

1) If both o1 or o2 are not included in the replayed snapshot and are above it (i.e., have a higher seq#), they fall under the gc deletes assumption.
2) If both o1 is part of the replayed snapshot but o2 is above it:
	- if o2 arrives first, o1 must arrive due to the recovery and potentially via replication as well. since gc deletes is disabled we are guaranteed to know of o2's existence.
3) If both o2 and o1 are part of the replayed snapshot:
	- we fall under the same scenarios as #2 - disabling GC deletes ensures we know of o2 if it arrives first.
4) If o1 falls before the snapshot and o2 is either part of the snapshot or higher:
	- Since the snapshot is guaranteed to contain all ops that are not part of lucene and are above the lc in the commit used, this means that o1 is part of lucene and o1 < local checkpoint. This means it won't be processed and we're not in the scenario we're discussing.
5) If o2 falls before the snapshot but o1 is part of it:
	- by the same reasoning above, o2 is < local checkpoint. Since o1 < o2, we also get o1 < local checkpoint and this will be dropped.


#### Implementation:

For local recovery, we can filter the ops we read of the translog and avoid replaying them. For peer recovery this is tricky as we do want to send the operations in order to have some history on the target shard. Filtering operations on the engine level (i.e., not indexing to lucene if op seq# <= lc) would work for both.
nik9000 pushed a commit that referenced this pull request Mar 22, 2018
In elastic#28350, we fixed an endless flushing loop which may happen on 
replicas by tightening the relation between the flush action and the
periodically flush condition.

1. The periodically flush condition is enabled only if it is disabled 
after a flush.

2. If the periodically flush condition is enabled then a flush will
actually happen regardless of Lucene state.

(1) and (2) guarantee that a flushing loop will be terminated. Sadly, 
the condition 1 can be violated in edge cases as we used two different
algorithms to evaluate the current and future uncommitted translog size.

- We use method `uncommittedSizeInBytes` to calculate current 
  uncommitted size. It is the sum of translogs whose generation at least
the minGen (determined by a given seqno). We pick a continuous range of
translogs since the minGen to evaluate the current uncommitted size.

- We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future 
  uncommitted size. It is the sum of translogs whose maxSeqNo at least
the given seqNo. Here we don't pick a range but select translog one by
one.

Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and 
seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3
while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is
excluded because its maxSeqno is still -1.

This commit removes both `sizeOfGensAboveSeqNoInBytes` and 
`uncommittedSizeInBytes` methods, then enforces an engine to use only
`sizeInBytesByMinGen` method to evaluate the periodically flush condition.

Closes elastic#29097
Relates #elastic#28350
nik9000 pushed a commit that referenced this pull request Feb 4, 2019
In elastic#28350, we fixed an endless flushing loop which may happen on
replicas by tightening the relation between the flush action and the
periodically flush condition.

1. The periodically flush condition is enabled only if it is disabled
after a flush.

2. If the periodically flush condition is enabled then a flush will
actually happen regardless of Lucene state.

(1) and (2) guarantee that a flushing loop will be terminated. Sadly,
the condition 1 can be violated in edge cases as we used two different
algorithms to evaluate the current and future uncommitted translog size.

- We use method `uncommittedSizeInBytes` to calculate current
  uncommitted size. It is the sum of translogs whose generation at least
the minGen (determined by a given seqno). We pick a continuous range of
translogs since the minGen to evaluate the current uncommitted size.

- We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future
  uncommitted size. It is the sum of translogs whose maxSeqNo at least
the given seqNo. Here we don't pick a range but select translog one by
one.

Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and
seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3
while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is
excluded because its maxSeqno is still -1.

This commit removes both `sizeOfGensAboveSeqNoInBytes` and
`uncommittedSizeInBytes` methods, then enforces an engine to use only
`sizeInBytesByMinGen` method to evaluate the periodically flush condition.

Closes elastic#29097
Relates #elastic#28350
nik9000 pushed a commit that referenced this pull request May 13, 2020
* * StartsWith is case sensitive aware
* Added case sensitivity to EQL configuration
* case_sensitive parameter can be specified when running queries (default
is case insensitive)
* Added STARTS_WITH function to SQL as well

* Add case sensitive aware queryfolder tests

* Address reviews

* Address review #2
nik9000 pushed a commit that referenced this pull request Mar 11, 2021
…astic#69765)

Previously we did not resolve the attributes recursively which meant that if a field or expression was re-aliased multiple times (through multiple levels of subqueries), the aliases were only resolved one level down. This led to failed query translation because `ReferenceAttribute`s were pointing to non-existing attributes during query translation.

For example the query

```sql
SELECT i AS j FROM ( SELECT int AS i FROM test) ORDER BY j
```

failed during translation because the `OrderBy` resolved the `j` ReferenceAttribute to another `i` ReferenceAttribute that was later removed by an Optimization:

```
OrderBy[[Order[j{r}#4,ASC,LAST]]]                                             ! OrderBy[[Order[i{r}#2,ASC,LAST]]]
\_Project[[j]]                                                                = \_Project[[j]]
  \_Project[[i]]                                                              !   \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..]
    \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..] ! 
```

By resolving the `Attributes` recursively both `j{r}` and `i{r}` will resolve to `test.int{f}` above:

```
OrderBy[[Order[test.int{f}elastic#22,ASC,LAST]]]                                     = OrderBy[[Order[test.int{f}elastic#22,ASC,LAST]]]
\_Project[[j]]                                                                = \_Project[[j]]
  \_Project[[i]]                                                              !   \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..]
    \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..] ! 
 ```

The scope of recursive resolution depends on how the `AttributeMap` is constructed and populated.

Fixes elastic#67237
nik9000 pushed a commit that referenced this pull request Mar 15, 2021
…astic#69765) (elastic#70322)

Previously we did not resolve the attributes recursively which meant that if a field or expression was re-aliased multiple times (through multiple levels of subqueries), the aliases were only resolved one level down. This led to failed query translation because `ReferenceAttribute`s were pointing to non-existing attributes during query translation.

For example the query

```sql
SELECT i AS j FROM ( SELECT int AS i FROM test) ORDER BY j
```

failed during translation because the `OrderBy` resolved the `j` ReferenceAttribute to another `i` ReferenceAttribute that was later removed by an Optimization:

```
OrderBy[[Order[j{r}#4,ASC,LAST]]]                                             ! OrderBy[[Order[i{r}#2,ASC,LAST]]]
\_Project[[j]]                                                                = \_Project[[j]]
  \_Project[[i]]                                                              !   \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..]
    \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..] ! 
```

By resolving the `Attributes` recursively both `j{r}` and `i{r}` will resolve to `test.int{f}` above:

```
OrderBy[[Order[test.int{f}elastic#22,ASC,LAST]]]                                     = OrderBy[[Order[test.int{f}elastic#22,ASC,LAST]]]
\_Project[[j]]                                                                = \_Project[[j]]
  \_Project[[i]]                                                              !   \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..]
    \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..] ! 
 ```

The scope of recursive resolution depends on how the `AttributeMap` is constructed and populated.

Fixes elastic#67237
nik9000 pushed a commit that referenced this pull request Mar 24, 2021
…astic#69765) (elastic#70325)

Previously we did not resolve the attributes recursively which meant that if a field or expression was re-aliased multiple times (through multiple levels of subqueries), the aliases were only resolved one level down. This led to failed query translation because `ReferenceAttribute`s were pointing to non-existing attributes during query translation.

For example the query

```sql
SELECT i AS j FROM ( SELECT int AS i FROM test) ORDER BY j
```

failed during translation because the `OrderBy` resolved the `j` ReferenceAttribute to another `i` ReferenceAttribute that was later removed by an Optimization:

```
OrderBy[[Order[j{r}#4,ASC,LAST]]]                                             ! OrderBy[[Order[i{r}#2,ASC,LAST]]]
\_Project[[j]]                                                                = \_Project[[j]]
  \_Project[[i]]                                                              !   \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..]
    \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..] ! 
```

By resolving the `Attributes` recursively both `j{r}` and `i{r}` will resolve to `test.int{f}` above:

```
OrderBy[[Order[test.int{f}elastic#22,ASC,LAST]]]                                     = OrderBy[[Order[test.int{f}elastic#22,ASC,LAST]]]
\_Project[[j]]                                                                = \_Project[[j]]
  \_Project[[i]]                                                              !   \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..]
    \_EsRelation[test][date{f}#6, some{f}#7, some.string{f}#8, some.string..] ! 
 ```

The scope of recursive resolution depends on how the `AttributeMap` is constructed and populated.

Fixes elastic#67237
nik9000 pushed a commit that referenced this pull request Oct 12, 2023
…c#100592)

* Don't print synthetic source in mapping for bwc tests

* Move comment.

* Don't print synthetic source in mapping for bwc tests #2

* Don't print synthetic source in mapping for bwc tests #2

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 034262c.

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 44e8156.

* Revert "Don't print synthetic source in mapping for bwc tests (elastic#100572)"

This reverts commit 9322ab9.

* Exclude synthetic source test from mixedClusterTests

* Update comment.
nik9000 pushed a commit that referenced this pull request Oct 23, 2023
….10 (elastic#100805)

* Don't print synthetic source in mapping for bwc tests

* Move comment.

* Don't print synthetic source in mapping for bwc tests #2

* Don't print synthetic source in mapping for bwc tests #2

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 034262c.

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 44e8156.

* Revert "Don't print synthetic source in mapping for bwc tests (elastic#100572)"

This reverts commit 9322ab9.

* Exclude synthetic source test from mixedClusterTests

* Update comment.

* Mute all tsdb tests in mixedClusterTests

This is an interim step to stop sporadic test failures, while we try to
fix version skip for mixed cluster tests.

* Remove old exclusion

* Add aggregations too

* Mute tests for versions between 8.7-8.10

* Remove mute
nik9000 pushed a commit that referenced this pull request Oct 23, 2023
* Don't print synthetic source in mapping for bwc tests

* Move comment.

* Don't print synthetic source in mapping for bwc tests #2

* Don't print synthetic source in mapping for bwc tests #2

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 034262c.

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 44e8156.

* Revert "Don't print synthetic source in mapping for bwc tests (elastic#100572)"

This reverts commit 9322ab9.

* Exclude synthetic source test from mixedClusterTests

* Update comment.

* Mute all tsdb tests in mixedClusterTests

This is an interim step to stop sporadic test failures, while we try to
fix version skip for mixed cluster tests.

* Remove old exclusion

* Add aggregations too

* Mute tests for versions between 8.7-8.10

* Remove mute

* Restore version skipping for position fields

* Restore version skip for synthetic source
nik9000 pushed a commit that referenced this pull request Oct 24, 2023
….10 (elastic#100805) (elastic#100814)

* Don't print synthetic source in mapping for bwc tests

* Move comment.

* Don't print synthetic source in mapping for bwc tests #2

* Don't print synthetic source in mapping for bwc tests #2

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 034262c.

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 44e8156.

* Revert "Don't print synthetic source in mapping for bwc tests (elastic#100572)"

This reverts commit 9322ab9.

* Exclude synthetic source test from mixedClusterTests

* Update comment.

* Mute all tsdb tests in mixedClusterTests

This is an interim step to stop sporadic test failures, while we try to
fix version skip for mixed cluster tests.

* Remove old exclusion

* Add aggregations too

* Mute tests for versions between 8.7-8.10

* Remove mute
nik9000 pushed a commit that referenced this pull request Oct 24, 2023
…c#100823)

* Don't print synthetic source in mapping for bwc tests

* Move comment.

* Don't print synthetic source in mapping for bwc tests #2

* Don't print synthetic source in mapping for bwc tests #2

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 034262c.

* Revert "Don't print synthetic source in mapping for bwc tests #2"

This reverts commit 44e8156.

* Revert "Don't print synthetic source in mapping for bwc tests (elastic#100572)"

This reverts commit 9322ab9.

* Exclude synthetic source test from mixedClusterTests

* Update comment.

* Mute all tsdb tests in mixedClusterTests

This is an interim step to stop sporadic test failures, while we try to
fix version skip for mixed cluster tests.

* Remove old exclusion

* Add aggregations too

* Mute tests for versions between 8.7-8.10

* Remove mute

* Restore version skipping for position fields

* Restore version skip for synthetic source
nik9000 added a commit that referenced this pull request Mar 13, 2024
When we run the csv-spec tests for ESQL against a real http endpoint we
actually run them twice - once async and once sync. But the names of the
tests didn't reflect that - they just looked like they were accidentally
duplicated. This updates the format. So this:
```
test {string.Trim}
test {string.Trim #2}
```

becomes:
```
test {string.Trim ASYNC}
test {string.Trim SYNC}
```
nik9000 added a commit that referenced this pull request Mar 13, 2024
When we run the csv-spec tests for ESQL against a real http endpoint we
actually run them twice - once async and once sync. But the names of the
tests didn't reflect that - they just looked like they were accidentally
duplicated. This updates the format. So this:
```
test {string.Trim}
test {string.Trim #2}
```

becomes:
```
test {string.Trim ASYNC}
test {string.Trim SYNC}
```
nik9000 pushed a commit that referenced this pull request Jul 17, 2024
nik9000 pushed a commit that referenced this pull request Sep 3, 2024
…alCentroidTests testAggregateIntermediate {TestCase=<geo_point> #2} elastic#112461
nik9000 pushed a commit that referenced this pull request Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants