Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Ingest Manager to handle Packages with ElasticSearch Transform #75153

Closed
nnamdifrankie opened this issue Aug 17, 2020 · 28 comments
Closed
Assignees
Labels
Team:Defend Workflows “EDR Workflows” sub-team of Security Solution Team:Fleet Team label for Observability Data Collection Fleet team v7.10.0

Comments

@nnamdifrankie
Copy link
Contributor

nnamdifrankie commented Aug 17, 2020

Describe the feature:

Given I have a package that contains an ElasticSearch Transform specification, we should be able to install, update, delete and start the transform based on the state of the package and prior history of the package.

Describe a specific use case for the feature:

  • When a package contains a new transform then we should install it and start it.
  • When a package contains an updated transform then we should update it and start it.
  • When a package does not contain the transform defined in previous versions then we should delete it.
  • When a package contains a transform with a changed name or identifier then we should resolve and ensure that only one copy of the transform is running.
@nnamdifrankie nnamdifrankie added the Team:Fleet Team label for Observability Data Collection Fleet team label Aug 17, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@elasticmachine
Copy link
Contributor

Pinging @elastic/endpoint-management (Team:Endpoint Management)

@ph
Copy link
Contributor

ph commented Aug 17, 2020

We have an early checklist to add custom/new types on the package elastic/package-spec#27

@nnamdifrankie
Copy link
Contributor Author

@ph It looks like a request to document what is needed for different assets in the package specs. I think the next step is to create some implementation tickets in the different repos.

@nnamdifrankie
Copy link
Contributor Author

nnamdifrankie commented Aug 17, 2020

@ruflin Can we get some implementation so we can start knocking off some of them off. I will add some constraints to the ticket about the source index has to exist for the transform to be successfully applied.

Constraints and Notes:

  • The source index has to exist before the transform is created and started. For Datastream indices the indices are created when document and added to the index.

@ruflin
Copy link
Member

ruflin commented Aug 18, 2020

There are two main issues I see at the moment:

  • What are we doing if not data is there yet (aka no index)
  • What are we doing with all the namespaces as the transform only applies to one

To get things moving here I suggest the following:

  • Create a PR with a package that contains a transform inside (probably best an updated endpoint package). I expect this PR to happen somewhere, where a registry can be spun up with this package for testing (@jonathan-buttner what is the workflow today for this for endpoint packages?)
  • Create a PR against Kibana that implements the CRUD logic for transforms. It is probably best to copy from what we do for ingest-pipeline or templates. @skh @neptunian should be able to point you to the right place here.

The above sidesteps a few issue:

  • No package spec: We only need the package spec definition when we get both PR's merged
  • Namespace: Lets focus on only using the default namespace for now

@nnamdifrankie Could you get the above 2 PR's started and link them here?

@nnamdifrankie
Copy link
Contributor Author

@ruflin

What are we doing if not data is there yet (aka no index)

Currently our best case is where data exist e.g. 7.9 to 7.10 upgrade. We have to consider the best option for getting a document into the into the source index. Currently we ignore certain document with certain attributes, we can create a similar document to use as a seed. But we have to decide who will send this document.

Create a PR with a package that contains a transform inside (probably best an updated endpoint package). I expect this PR to happen somewhere, where a registry can be spun up with this package for testing

We always test the registry locally while developing using the docker, but we can also use the code https://github.com/elastic/endpoint-package/blob/master/Makefile#L161

@nnamdifrankie
Copy link
Contributor Author

@elastic/ingest-management @elastic/endpoint-management

Following my exploration of the EPM code, I have been able to install a transform but not start it. Starting it requires the source index to exist, we will explore options with the Elastic Search team. I however want to get you input on installation strategy for Transforms.

Transforms in a dataset in a package have the cases that influence how we perform the installation.

  • A transform can be moved to another dataset (different installation name) after it has been installed in a different dataset in prior versions. If we do not properly detect this drift and clean up the old version we could multiples.

  • It is wasteful and incorrect to have multiple version of the transform doing the same processing if the intention was to move the transform like in the example given above.

  • Other attributes of the transform can change. We can always detect this using some form of hashing.

Candidate Solutions:

  1. Install and Delete:

Assuming current state is the desired state, we can delete the old transform with the dataset prefix and version along with its reference in SO after we have successfully installed the new transform if any exist (a move case). Before installation capture all Transform and Object reference with the dataset prefix. After successful installation delete using the information from the capture. The consideration is that we continue to support any rollback agreements in the case of failure. Deleting after installing may help satisfy this requirement. Also deleting may fail which result in multiple version and old transforms.

  1. Calculate, Install and delete:

Calculate the diffs, drift and change cases and apply install and delete actions. The consideration here is catching all the cases and also ensuring that we maintain rollback agreements in the case of failure. Also have to consider delete failures which can result in an inconsistent and undesired state.

Your comments are welcomed.

@ph
Copy link
Contributor

ph commented Aug 26, 2020

Thanks @nnamdifrankie for the spike, @skh or @neptunian can you look at this?

@ruflin
Copy link
Member

ruflin commented Aug 27, 2020

@nnamdifrankie I think I don't fully understand your example of a transform moving to an other dataset which unfortunately is important for all the follow ups. Could you share an example?

To get started, I would keep it as simple as possible. Would the following flow be possible?

  • Stop transform
  • Delete transform
  • Load new transform
  • Start new transform

@nnamdifrankie Could this cause any side effects? You mention above that parts of the transform can change. My assumption is that these changes always mean a new version of the package?

If we follow the above, also rollback should be pretty straight forward as it is just the same in reverse.

An other open question for me is where in the chain of asset installation does transform fit it on install / upgrade. I assume after the templates and ingest pipelines have been loaded but before UI elements?

An other thing I learned yesterday when talk to the ML team is that the source index can be a patter. So it can be logs-foo-* as the source. Interesting would be if the target index could contain a variable. Something like logs-foo-{data_stream.namespace}. It would mean, a single transform would act for multiple namespaces, but I doubt this is possible at the moment. If there would be multiple source indices, what would be our expectations for the target index. All data in a single target index or one per namespace?

@nnamdifrankie
Copy link
Contributor Author

@ruflin Sorry I was not clear earlier.

I think I don't fully understand your example of a transform moving to an other dataset which unfortunately is important for all the follow ups. Could you share an example?

Given we have the transform in previous version.

Endpoint Package 0.15.0:

We have a transform in the metadata dataset package/endpoint/dataset/metadata/elasticsearch/transform/default.json. This will create a transform metrics-endpoint.metadata-default-0.15.0-[timestamp]. We can also use metrics-endpoint.metadata-default but this will not allow us to rollback in case of failure of the transform if the current version fails.

Endpoint Package 0.16.0:

We move the transform to metadata_current dataset package/endpoint/dataset/metadata_current/elasticsearch/transform/default.json. This will create a transform metrics-endpoint.metadata_current-default-0.16.0-[timestamp]. This could be a new transform or similar to the transform in metadata dataset. But the ultimate outcome is the metadata dataset transform should be deleted, and this one installed.

@nnamdifrankie
Copy link
Contributor Author

@ruflin

An other thing I learned yesterday when talk to the ML team is that the source index can be a patter. So it can be logs-foo-* as the source. Interesting would be if the target index could contain a variable. Something like logs-foo-{data_stream.namespace}. It would mean, a single transform would act for multiple namespaces, but I doubt this is possible at the moment. If there would be multiple source indices, what would be our expectations for the target index. All data in a single target index or one per namespace?

It is possible that the wildcard picks up disjoint documents that matches the query and pivot of the transform. The documents will be transferred to the destination index. But the mapping of the destination index determines how usefulness of the documents. If the document maps correctly then it will be retrieved in queries, else it will not.

@nnamdifrankie
Copy link
Contributor Author

@ruflin

To get started, I would keep it as simple as possible. Would the following flow be possible?
Stop transform
Delete transform
Load new transform
Start new transform
@nnamdifrankie Could this cause any side effects? You mention above that parts of the transform can change. My assumption is that these changes always mean a new version of the package?
If we follow the above, also rollback should be pretty straight forward as it is just the same in reverse.

My plan is was to

  1. Capture current transform state information for the current dataset either using the saved object reference or elastic, the former is preferred as Elastic can be modified. This id will be used for deletion when we succeed in installing things.

  2. Install Transform. Since we are time stamping our ids, we install the transform. Also this could be a force install of the package again. Also this could be a NoOp if the transform has been removed from the dataset.

  3. Start Transform. TBD given the option we choose.

  4. If Success. Delete the old state using the captured information from Step 1. Deleting old information can fail also in the worst case scenario.

  5. If Failure. Delete the current state or resources we created for this version, technically this is a rollback. Deleting can also fail in a worst case scenario.

The goal is to have one transform per purpose because any slight difference in the code could mean different documents in the target.

@neptunian
Copy link
Contributor

We have a transform in the metadata dataset package/endpoint/dataset/metadata/elasticsearch/transform/default.json. This will create a transform metrics-endpoint.metadata-default-0.15.0-[timestamp]. We can also use metrics-endpoint.metadata-default but this will not allow us to rollback in case of failure of the transform if the current version fails.

@nnamdifrankie Isn't the version in the name enough without having the timestamp to allow a rollback in case of a failure? Can you explain the need for the timestamp? Thanks.

@nnamdifrankie
Copy link
Contributor Author

@neptunian

Isn't the version in the name enough without having the timestamp to allow a rollback in case of a failure? Can you explain the need for the timestamp? Thanks.

It is for the forced install where the versions are the same.

@ruflin
Copy link
Member

ruflin commented Aug 31, 2020

I think I'm also still missing the part around the force install and the timestamp. When is this exactly happening?

You mention above

Capture current transform state information

What does this exactly mean? Could you share an example?

You also mention:

The goal is to have one transform per purpose

Are you referring here to multiple transform per dataset?

My current assumptions are and please let me know which ones are wrong:

  • We can stop, overwrite, start a transform. As long as we do this in a short time, no data in the target index will be missing.
  • We can do exactly the same to rollback. The old version of the transform is always available in the old package.
  • If everything falls apart during upgrade / downgrade, we can wipe the target index and start the transform for scratch again and it will be able to fully rebuild the target index. This assumes we don't throw away data in the source index.

@nnamdifrankie
Copy link
Contributor Author

nnamdifrankie commented Aug 31, 2020

@ruflin Sorry it was not clear. First let me answer in the context of your steps here

Stop transform
Delete transform
Load new transform
Start new transform

My proposal will do which is similar to your step but just a change in order.

  1. Take a list of current transforms. There can be many transforms in the dataset.
  2. Stop the transforms
  3. Install the new transforms. The transform may have been removed from the dataset and so nothing installs.
  4. If we succeed to install, the delete the transforms from 1.
  5. If we fail, then delete whatever resources that were install during the failure depending on where the failure occurred, partial install. We want all or nothing I believe. Then restart the current transforms from list 1. The is the rollback, but if you say that a failure will prompt EPM to reinstall previous version then maybe this is void.

I think I'm also still missing the part around the force install and the timestamp. When is this exactly happening?
You mention above
Capture current transform state information
What does this exactly mean? Could you share an example?

With my steps above then we will do not plan to install over any transforms. Transform are only removed after we have successfully installed the new transforms. Hence the need for timestamps or unique identifier even in a forced version update.

You also mention:
The goal is to have one transform per purpose
Are you referring here to multiple transform per dataset?

Yes if we have dataset1.transform1 and dataset2.transform2 that have the same code and update the same index and run at different time. I believe this not a desirable state. What do you think?

We can do exactly the same to rollback. The old version of the transform is always available in the old package.

Do you currently have a rollback handler in case of failure that tries to install the previous version?

We can stop, overwrite, start a transform. As long as we do this in a short time, no data in the target index will be missing.

I can only answer this question by testing. Everything is timing with this setup.

If everything falls apart during upgrade / downgrade, we can wipe the target index and start the transform for scratch again and it will be able to fully rebuild the target index. This assumes we don't throw away data in the source index.

Let talk about this for clarity. Wiping the destination index is technical a service outage.

@ruflin
Copy link
Member

ruflin commented Aug 31, 2020

Looks like we are mostly on the same page. The only different is if transform is overwritten (what we do for index templates) or if we use versions (what we do for ingest pipelines). I guess both will work. If we use a version for the transform, lets use the package version.

Could you test on what the maximum time is we have to get the new transform in place?

For the rollback, @neptunian can share here more.

What is our naming convention for the case where we have multiple transforms in a single dataset? Will we postfix the file name without the .json part? What is the final name in Elasticsearch.

@nnamdifrankie
Copy link
Contributor Author

What is the final name in Elasticsearch

{
"id": "metrics-endpoint.metadata-current-default-0.16.0-dev.0-20207319",
"type": "transform"
}

Where 0.16.0-dev.0 is the version and 20207319 is a unique timestamp.

@ruflin
Copy link
Member

ruflin commented Sep 1, 2020

Two questions:

  • The default part in the name above, is this the name of the file in the package?
  • For the timestamp + force install: How does this work today for other assets like ingest pipeline. Do we just overwrite it? Or do we delete and install again? @neptunian Might be able to share more details here.

@nnamdifrankie It seems the force install you need for development purpose or is this something you expect to see in production? I'm worried we build something for dev that we should potentially solve in a different way. For example we could have a special method for "overwrite / force install" that does the right thing for each asset and in the case of transform, it would be delete and install it again.

@neptunian
Copy link
Contributor

If by forced install, if you mean "reinstall" and the versions are the same, we don't delete the previous ingest pipeline if the version is the same or it's a reinstall. Since we PUT to the ingest pipeline it would update the existing versionized one if it exists.

@ruflin We don't have a rollback currently for unknown errors that aren't handled. I think it was mentioned here and that we'd improve on it but I realize some default behavior should happen. Currently it will just error out and when you refresh kibana it will do a "package health check" and try to reinstall whatever package you were trying to install. This behaviour is mainly to handle unknown errors that cause kibana to crash, though. We should add a case for catching any unknown error and trying to reinstall the previous version. This should be a minor change where we try to install the previous in an else clause here. There is no special rollback handler.

@nnamdifrankie
Copy link
Contributor Author

@neptunian @ruflin let make sure there is a ticket for this rollback so it does not fall through the cracks. Right now I have only handled the happen path with the belief that rollback will be handled by the main handler.

@neptunian
Copy link
Contributor

#76663

@kevinlog kevinlog added planning and removed planning labels Sep 3, 2020
@nnamdifrankie
Copy link
Contributor Author

#74394

@nnamdifrankie
Copy link
Contributor Author

@ruflin @kevinlog Are we fine to close this ticket? We can create more focused stories if needed.

@ruflin
Copy link
Member

ruflin commented Sep 14, 2020

@nnamdifrankie I'm good with closing it but lets make sure we follow up with the issues like the index problem.

@ph
Copy link
Contributor

ph commented Oct 19, 2020

@nnamdifrankie Can you create an issue for index problem
The source index has to exist before the transform is created and started. For Datastream indices the indices are created when document and added to the index.

And close this one?

@ph ph closed this as completed Oct 19, 2020
@ph ph reopened this Oct 19, 2020
@ph ph added the v7.10.0 label Oct 19, 2020
@nnamdifrankie
Copy link
Contributor Author

@ph I will probably have to create in ML issues board and link this one.

@MindyRS MindyRS added Team:Defend Workflows “EDR Workflows” sub-team of Security Solution and removed Team:Endpoint Management labels Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Defend Workflows “EDR Workflows” sub-team of Security Solution Team:Fleet Team label for Observability Data Collection Fleet team v7.10.0
Projects
None yet
Development

No branches or pull requests

7 participants