Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Various MlWithSecurityIT tests failing on Windows with incorrect auth failures #31058

Closed
droberts195 opened this issue Jun 4, 2018 · 13 comments
Labels
blocker :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts :ml Machine learning Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI v6.4.0 v7.0.0-beta1

Comments

@droberts195
Copy link
Contributor

There have been several cases in the last few days of the x-pack/qa/smoke-test-ml-with-security suite failing on Windows due to auth failures for users who ought to have permissions to call the endpoints they're calling.

An example from master is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1675/console

action [cluster:monitor/xpack/ml/job/results/overall_buckets/get] is unauthorized for user [ml_admin]

An example from 6.x is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-windows-compatibility/1099/console

action [cluster:monitor/main] is unauthorized for user [ml_user]

I suspect this is a side effect of a security change, as no changes have recently been made to the ML permissions. The Windows builds have been repeatedly failing for around 2 weeks due to various problems, so the security change could be up to 2 weeks old. But its effects on ML would have been masked by failures earlier in the build.

@droberts195 droberts195 added >test-failure Triaged test failures from CI v7.0.0 :ml Machine learning :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC v6.4.0 labels Jun 4, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@droberts195
Copy link
Contributor Author

droberts195 commented Jun 4, 2018

#30794 was merged on 24th May, and is present in 6.4 but not 6.3 which would be consistent with these spurious auth failures affecting CI for 6.4 but not 6.3.

(#30787 was also merged on 24th May, but into 6.3 too, so probably not the cause of this problem as 6.3 builds are succeeding.)

@mayya-sharipova
Copy link
Contributor

Recent failures of MlWithSecurityUserRoleIT tests on Windows on master are in: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1678/console

An example of one of them:

REPRODUCE WITH: gradlew :x-pack:qa:smoke-test-ml-with-security:integTestRunner \
  -Dtests.seed=9B4D2B8ED6F5BF3E \
  -Dtests.class=org.elasticsearch.smoketest.MlWithSecurityUserRoleIT \
  -Dtests.method="test {yaml=ml/start_stop_datafeed/Test start given field without mappings}" \
  -Dtests.security.manager=true \
  -Dtests.locale=fi-FI \
  -Dtests.timezone=Australia/Adelaide \
  -Dtests.rest.blacklist=ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code,ml/calendar_crud/Test get calendar given missing,ml/calendar_crud/Test cannot create calendar with name _all,ml/calendar_crud/Test PageParams with ID is invalid,ml/calendar_crud/Test post calendar events given empty events,ml/calendar_crud/Test put calendar given id contains invalid chars,ml/calendar_crud/Test delete event from non existing calendar,ml/calendar_crud/Test delete job from non existing calendar,ml/custom_all_field/Test querying custom all field,ml/datafeeds_crud/Test delete datafeed with missing id,ml/datafeeds_crud/Test put datafeed referring to missing job_id,ml/datafeeds_crud/Test put datafeed with invalid query,ml/datafeeds_crud/Test put datafeed with security headers in the body,ml/datafeeds_crud/Test update datafeed with missing id,ml/delete_job_force/Test cannot force delete a non-existent job,ml/delete_model_snapshot/Test delete snapshot missing snapshotId,ml/delete_model_snapshot/Test delete snapshot missing job_id,ml/delete_model_snapshot/Test delete with in-use model,ml/filter_crud/Test create filter api with mismatching body ID,ml/filter_crud/Test get filter API with bad ID,ml/filter_crud/Test invalid param combinations,ml/filter_crud/Test non-existing filter,ml/get_datafeed_stats/Test get datafeed stats given missing datafeed_id,ml/get_datafeeds/Test get datafeed given missing datafeed_id,ml/jobs_crud/Test cannot create job with existing categorizer state document,ml/jobs_crud/Test cannot create job with existing quantiles document,ml/jobs_crud/Test cannot create job with existing result document,ml/jobs_crud/Test cannot create job with model snapshot id set,ml/jobs_crud/Test cannot decrease model_memory_limit below current usage,ml/jobs_crud/Test get job API with non existing job id,ml/jobs_crud/Test put job after closing results index,ml/jobs_crud/Test put job after closing state index,ml/jobs_crud/Test put job with inconsistent body/param ids,ml/jobs_crud/Test put job with time field in analysis_config,ml/jobs_crud/Test job with categorization_analyzer and categorization_filters,ml/jobs_get/Test get job given missing job_id,ml/jobs_get_result_buckets/Test mutually-exclusive params,ml/jobs_get_result_buckets/Test mutually-exclusive params via body,ml/jobs_get_result_categories/Test with invalid param combinations,ml/jobs_get_result_categories/Test with invalid param combinations via body,ml/jobs_get_result_overall_buckets/Test overall buckets given missing job,ml/jobs_get_result_overall_buckets/Test overall buckets given non-matching expression and not allow_no_jobs,ml/jobs_get_result_overall_buckets/Test overall buckets given top_n is 0,ml/jobs_get_result_overall_buckets/Test overall buckets given top_n is negative,ml/jobs_get_result_overall_buckets/Test overall buckets given invalid start param,ml/jobs_get_result_overall_buckets/Test overall buckets given invalid end param,ml/jobs_get_result_overall_buckets/Test overall buckets given bucket_span is smaller than max job bucket_span,ml/jobs_get_stats/Test get job stats given missing job,ml/jobs_get_stats/Test no exception on get job stats with missing index,ml/job_groups/Test put job with empty group,ml/job_groups/Test put job with group that matches an job id,ml/job_groups/Test put job with group that matches its id,ml/job_groups/Test put job with id that matches an existing group,ml/job_groups/Test put job with invalid group,ml/ml_info/Test ml info,ml/post_data/Test Flush data with invalid parameters,ml/post_data/Test flushing and posting a closed job,ml/post_data/Test open and close with non-existent job id,ml/post_data/Test POST data with invalid parameters,ml/preview_datafeed/Test preview missing datafeed,ml/revert_model_snapshot/Test revert model with invalid snapshotId,ml/start_stop_datafeed/Test start datafeed job, but not open,ml/start_stop_datafeed/Test start non existing datafeed,ml/start_stop_datafeed/Test stop non existing datafeed,ml/update_model_snapshot/Test without description,ml/validate/Test invalid job config,ml/validate/Test job config is invalid because model snapshot id set,ml/validate/Test job config that is invalid only because of the job ID,ml/validate_detector/Test invalid detector

@jaymode jaymode self-assigned this Jun 5, 2018
@jaymode
Copy link
Member

jaymode commented Jun 5, 2018

Given this is related to authorization, this is unlikely related to #30794 as we would see a different type of failure. I'll see what I can dig up, but one issue I have is why are we only seeing this on windows?

@droberts195
Copy link
Contributor Author

@jaymode you're correct that #30794 is not to blame for this - sorry I jumped to conclusions by looking for security PRs in the change list.

So far I've worked out that the breakage occured somewhere in between ad43231 and 946f519 on the 6.x branch, i.e. late on 22nd May or early on 23rd May and more than 20 commits before your security PRs went into 6.x. I'll try to narrow it down further.

@tvernum
Copy link
Contributor

tvernum commented Jun 6, 2018

@droberts195 I wonder if it's due to bb8f1e8

That's windows only, and this smoke test has a number of setupCommand in the build.gradle

@droberts195
Copy link
Contributor Author

You are correct @tvernum. The ML with security tests pass on commit 4d3e285 but all fail with auth failures on bb8f1e8 (which is the one immediately afterwards).

@droberts195
Copy link
Contributor Author

droberts195 commented Jun 6, 2018

I've made some more progress.

The problem occurs because when multiple comma separated roles are supplied to the elasticsearch-users.bat script, the commas now get replaced with spaces before the roles arrive in UsersTool.main().

In other words, before bb8f1e8 one of our setup commands meant that UsersTool.main() received an args[] containing this:

[useradd, ml_admin, -p, x-pack-test-password, -r, minimal,machine_learning_admin]

But after bb8f1e8 UsersTool.main() receives an args[] containing this:

[useradd, ml_admin, -p, x-pack-test-password, -r, minimal machine_learning_admin]

The docs say that a comma separated list of roles is supported (search for "This option accepts a comma-separated list of role names to assign to the user."), therefore this is a regression and a blocker for 6.4 because elasticsearch-users.bat no longer works as documented.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@droberts195 droberts195 added :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts and removed :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Jun 6, 2018
@droberts195
Copy link
Contributor Author

droberts195 commented Jun 6, 2018

https://stackoverflow.com/questions/8145744/how-to-invoke-a-parameter-in-a-batch-file-with-comma-delimiter is relevant here regarding treatment of commas in batch files. Unfortunately neither of the solutions work in our case as we're trying to separate the first argument out of the batch file arguments while leaving all the others quoted exactly as they were previously, and that seems to be impossible in general. However it is done there will always be some combination of quotes, spaces, commas or semi-colons where the chosen algorithm fails.

The only solutions I can think of are:

  1. Change the implementation of Reduce CLI scripts to one-liners on Windows #30772 such that the name of the main class is passed to elasticsearch-cli.bat in an environment variable rather than an argument. Then the raw argument list can be passed on to Java without any manipulation in the batch file.
  2. Revert Reduce CLI scripts to one-liners on Windows #30772.

Do you have any better ideas @jasontedor?

@jasontedor
Copy link
Member

Thanks a lot @droberts195. I favored option one as that was an approach I considered when I initially made the change that led to the breakage here. I am sorry for the build breakage and that you and @tvernum lost time to this.

@jasontedor
Copy link
Member

Closed by #31156

@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts :ml Machine learning Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI v6.4.0 v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

8 participants