Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test: Chrome X-Pack Observability UI Functional Tests.x-pack/test/observability_functional/apps/observability/pages/overview/alert_table·ts - ObservabilityApp Observability overview Without alerts navigate and open alerts section #140507

Closed
kibanamachine opened this issue Sep 12, 2022 · 27 comments · Fixed by #145828
Assignees
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"

Comments

@kibanamachine
Copy link
Contributor

kibanamachine commented Sep 12, 2022

A test failed on a tracked branch

Error: retry.try timeout: TimeoutError: The element [data-test-subj="internalAlertsPageLoading"] was still present when it should have disappeared.
Wait timed out after 10165ms
    at /var/lib/buildkite-agent/builds/kb-n2-4-spot-cf11dbdfc244df0c/elastic/kibana-on-merge/kibana/node_modules/selenium-webdriver/lib/webdriver.js:906:17
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at onFailure (test/common/services/retry/retry_for_success.ts:17:9)
    at retryForSuccess (test/common/services/retry/retry_for_success.ts:59:13)
    at RetryService.try (test/common/services/retry/retry.ts:31:12)
    at waitForAlertsTableLoadingToDisappear (x-pack/test/functional/services/observability/overview/common.ts:57:5)
    at Object.openAlertsSectionAndWaitToAppear (x-pack/test/functional/services/observability/overview/common.ts:70:5)
    at Context.<anonymous> (x-pack/test/observability_functional/apps/observability/pages/overview/alert_table.ts:35:9)
    at Object.apply (node_modules/@kbn/test/target_node/src/functional_test_runner/lib/mocha/wrap_function.js:87:16)

First failure: CI Build - main

@kibanamachine kibanamachine added the failed-test A test failure on a tracked branch, potentially flaky-test label Sep 12, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Sep 12, 2022
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@spalger spalger added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Sep 12, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Sep 12, 2022
@spalger
Copy link
Contributor

spalger commented Sep 12, 2022

@maryam-saeidi looks like this test is failing already and was added in #140024, going to skip, would you mind checking out the flakiness of this?

@spalger
Copy link
Contributor

spalger commented Sep 12, 2022

/skip

@kibanamachine
Copy link
Contributor Author

Skipped

main: 30af54d

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@maryam-saeidi
Copy link
Member

@spalger Sure, I'll check it.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Sep 19, 2022

Result of first round of investigation:

I’ve added a log in the else section to see what is the response in case a test fails and we end up there:
image

Here is the result in the log for one of the failed functional test: (after selecting log for a failed test, search for =======> else response: in the mentioned link)

{
  "id": "FmNZS3dJQ0FJUjVtVXlZTW1UVnhxU0EbVzBaR2xRcnVRTk9SQjN6VEkzSFk1Zzo3NDI2",
  "rawResponse": {
    "took": 6,
    "timed_out": false,
    "terminated_early": false,
    "num_reduce_phases": 0,
    "_shards": {
      "total": 3,
      "successful": 0,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
    }
  },
  "isPartial": true,
  "isRunning": true,
  "total": 3,
  "loaded": 0,
  "isRestored": false
}

So it is not a complete response based on this condition:

isCompleteResponse: Boolean(response && response.isRunning && response.isPartial)

And it is also not a failure based on this condition:

isErrorResponse: !response || !response.rawResponse || (!response.isRunning && !!response.isPartial)

Hence ending up in else branch.

Now I am checking the meaning of isPartial and isRunning. Also, I ran the test again without disabling loader in case we are in else to see if it will try to fetch the alerts again.
Also, I want to take a look at other alerts table tests to see if they are enabled and working properly.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Sep 19, 2022

When I removed disabling loading from else, the test failed less (only 1 in 10) and in the failed test see this log:

{
  "rawResponse": {
    "took": 5,
    "timed_out": false,
    "_shards": {
      "total": 3,
      "successful": 3,
      "skipped": 2,
      "failed": 0
    },
    "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
    }
  },
  "isPartial": false,
  "isRunning": false,
  "total": 3,
  "loaded": 3,
  "isRestored": false
}

So it seems like a complete response, the question now is why in the screenshot, we still have a loading without alerts table:

image

It's good to know that the API debug log happens at [00:02:29] and the test fails at [00:03:00]:

image

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Sep 20, 2022

When I was checking other alerts table functional tests, I noticed that they are fine and this issue only happens for the overview page, which the difference is having the table in an accordion 😕

After discussing it with @XavierM, he had 2 proposals to check:

  1. What if instead of waiting for the loading state to stop, try to check the existence of the component that we want to check later?
  2. What if we don't need to open the accordion during the functional test?

For 1, I don't think that would be an issue, because, in the snapshot, we still see the loading state, so other parts of the table will not be visible. Also, I remember, I had an issue with the test failing so I added the step to check if loading is still there, so this step was added because of the failure.

For 2, I changed the code to always have the accordion open and see what will happen in that case, and now tests are passing 🤯 (https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1244)
So I assume this difference is causing an issue @XavierM @spalger any idea how to fix this flakiness?

@spalger
Copy link
Contributor

spalger commented Sep 21, 2022

I can't imagine option 1 helping since the test is failing after 30 seconds and then accordion is still loading. Last time we talked we assumed that there was some flaw in the UI that lead to the accordion never exiting the loading state in some cases, have you proved that isn't the case @maryam-saeidi? Maybe we can just try extending the timeout to 2 minutes? If the test still fails I really don't think it's the test and the two solutions you've proposed are implying that the problem is with the tests, not the UI.

@maryam-saeidi
Copy link
Member

@spalger yes, I've added a log based on our discussion and based on the screenshot here, it seems we fetch the data correctly in the failed test but the loading is still there even after 30 secs.

The strange part is, the tests are passing when we load the table at the page load time but if we put the table in accordion and the accordion is closed at the page load time, then we have the flaky test after opening the accordion. 😖

@spalger
Copy link
Contributor

spalger commented Sep 21, 2022

the tests are passing when we load the table at the page load time but if we put the table in accordion and the accordion is closed at the page load time

Still sounds like a UI issue to me, one that users are likely to encounter in conditions similar to CI

@maryam-saeidi
Copy link
Member

When I logged the isLoading variable, it seems the request is successful and the alert response is back but for some reason, the component sets the isLoading to true again: (here)
image

@spalger
Copy link
Contributor

spalger commented Sep 23, 2022

Nice find @maryam-saeidi! Have you inspected all the places where the isLoading state is set in the component?

@maryam-saeidi
Copy link
Member

There is only one place, I've added more logs to see whether I can identify what change causes extra loading for the component.
The most confusing part is the fact that it only happens if the alert component is in an accordion and if that accordion is initially closed on the page. It does not make sense to me at all, have you heard of an issue caused by a component being hidden?! Can rendering be different when the component is hidden?
I also tried slow network locally for many times but it didn't happen there. (It happens 1 out of 10 times in functional tests)

@spalger
Copy link
Contributor

spalger commented Sep 23, 2022

Sounds to me like maybe a useEffect in something that's rendered inside that accordion might have a race condition. This sounds like a classic race condition without knowing more about exactly what has access to what. If you want to send me some links to components I can look around and see if I can spot anything suspicious.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Sep 23, 2022

Ah, that would be great, here you can find the 2 files that I've added logs for debugging:
https://github.com/elastic/kibana/pull/140638/files

  • Component: x-pack/plugins/triggers_actions_ui/public/application/sections/alerts_table/alerts_table_state.tsx
  • Hook for fetching data: x-pack/plugins/triggers_actions_ui/public/application/sections/alerts_table/hooks/use_fetch_alerts.tsx

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Sep 23, 2022

It seems this difference is causing the re-rendering: (link)

image

@spalger if it is a race condition, shouldn't it also happen when the component is outside of the accordion?

@spalger
Copy link
Contributor

spalger commented Sep 23, 2022

@maryam-saeidi Ok, I think I found it: https://github.com/elastic/kibana/blob/main/x-pack/plugins/triggers_actions_ui/public/application/sections/alerts_table/hooks/use_fetch_alerts.tsx#L175-L176

If asyncSearch() or refetch.current() is called when data && data.search doesn't resolve to true then loading will be set to true even though a request is not being sent.

I think if we move the dispatch({ type: 'loading', loading: true }); line into the if() body then it should only set the loading state if it's actually going to make a request.

@maryam-saeidi
Copy link
Member

@spalger Wow, interesting! Let me give it a try.

@maryam-saeidi
Copy link
Member

@spalger unfortunately, it still has the same issue. (flaky test runner)

The most confusing part is the fact that the issue only happens when the alerts table is in an accordion which is closed at the start. So I am wondering what is different in the initially hidden state that causes this issue.

@spalger
Copy link
Contributor

spalger commented Sep 27, 2022

I am wondering what is different in the initially hidden state that causes this issue.

I'm pretty sure that when the accordion is hidden the content isn't rendered, so the primary change is that the content is being rendered some time after the page was rendered, but something it's happening a little later and sometimes it's happening a little sooner based on the speed of selenium opening the accordion.

Either way, I think focusing on how the loading state is shown, and tracking back to how that is occurring, is the most fruitful path to follow.

I realized I had stopped looking for how the loading state is represented when I saw the isLoading variable, but that's not the only think that causes the loading indicator to be rendered. This is where the loading indicator is rendered:

{(isLoading || isBrowserFieldDataLoading) && (
<EuiProgress size="xs" color="accent" data-test-subj="internalAlertsPageLoading" />
)}

The other variable used there, isBrowserFieldDataLoading, is coming from this hook:

In that hook it looks like we have another case where an error will cause the component to always be in a "loading" state.

Do we have any insight into how that hook is working?

@maryam-saeidi
Copy link
Member

isBrowserFieldDataLoading was added recently and the issue was happening even before introducing isBrowserFieldDataLoading.

@spalger
Copy link
Contributor

spalger commented Sep 27, 2022

Okay, but that doesn't change the fact that the component is rendering a loading component indefinitely so the answer is in the logic somewhere...

@maryam-saeidi
Copy link
Member

Yes, I see your point.
I also want to check whether the request was actually made and if it was aborted.
Based on the logs, aborted was false when the issue happened.

I will check it later this week and keep you posted about the result.

@maryam-saeidi
Copy link
Member

It seems this fix also solved the flaky test issue.

Flaky test runner:

I am still curious why the second request is not getting any response but I think the fix is avoiding this second request since the time hasn't been changed, so the second request should not trigger the re-fetching as it was stated here.

cc @spalger, @XavierM, @jcger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"
Projects
None yet
4 participants