[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug. #3485

amolnater-qasource · 2024-04-23T05:17:25Z

Kibana Build details:

VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Artifact Link: https://staging.elastic.co/8.14.0-a40d088a/summary-8.14.0.html

Host OS: All

Preconditions:

8.14.0-BC1 Kibana self-managed environment should be available.
Fleet Server should be installed.

Steps to reproduce:

Navigate to Fleet>Agents>Agent logs tab.
Update logging level to debug.
Observe fleet-server gets offline permanently and memory consumption increases.

Expected Result:
Fleet Server should remain Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-04-23T04-48-12Z-00.zip

Screenshot:

Note:

Issue is consistently reproducible at our end.

The text was updated successfully, but these errors were encountered:

amolnater-qasource · 2024-04-23T05:17:35Z

@manishgupta-qasource Please review.

manishgupta-qasource · 2024-04-23T05:25:48Z

Secondary review for this ticket is Done

cmacknz · 2024-04-23T19:36:22Z

components:
    - id: fleet-server-default
      state:
        component:
            apmconfig: null
            limits:
                gomaxprocs: 0
                source:
                    fields:
                        go_max_procs:
                            kind:
                                numbervalue: 0
        component_idx: 2
        features_idx: 2
        message: 'Healthy: communicating with pid ''6060'''
        state: 2
        units:
            input-fleet-server-default-fleet-server-fleet_server-a4eeee2f-bf68-436c-8c3f-f860be6f8299:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
            output-fleet-server-default:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
        version_info:
            build_hash: "11861004"
            meta:
                build_time: 2024-04-18 09:05:58 +0000 UTC
                commit: "11861004"
            name: fleet-server

fleet_message: |+
    fail to checkin to fleet-server: all hosts failed: 1 error occurred:
    	* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/f9489d84-c941-40ef-84eb-e07adcf4b37c/checkin?": dial tcp 127.0.0.1:8221: connectex: No connection could be made because the target machine actively refused it.

fleet_state: 4
log_level: debug
message: 1 or more components/units in a failed state
state: 3

cmacknz · 2024-04-23T20:14:46Z

I see logs like this frequently repeating:

{"log.level":"info","@timestamp":"2024-04-23T04:47:39.249Z","message":"Error - could not start the HTTP server for the API: failed to listen on the named pipe \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","service.type":"fleet-server","state":"FAILED","ecs.version":"1.6.0","ecs.version":"1.6.0"}

michel-laterman · 2024-04-23T22:30:47Z

This error is from when the fleet-server tries to start the local metrics server, specifically in github.com/elastic/elastic-agent-libs/api; with https://github.com/elastic/elastic-agent-libs/blob/main/api/routes.go#L39

michel-laterman · 2024-04-23T22:44:41Z

The changes in elastic-agent-api are:

Add NewWithConfig init function to api/server.go elastic-agent-libs#192
Add Addr() method to server elastic-agent-libs#193
But these should not have an impact on the method call we use

michel-laterman · 2024-04-23T22:50:03Z

Is this recreateable on any other OS, or is it just on windows?

amolnater-qasource · 2024-04-24T07:31:50Z

Hi @michel-laterman

Thank you for looking into this issue.

We have revalidated this issue for linux fleet server on 8.14.0 BC1 kibana cloud environment and had below observations:

Observations:

Linux fleet server gets offline for sometime on setting logging level to debug.
However it gets back Healthy and memory consumption also doesn't increase like Windows fleet-server.

Logs for Linux fleet-server:
elastic-agent-diagnostics-2024-04-24T06-12-49Z-00 (1).zip

Build details:
VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Screenshot:

Please let us know if anything else is required from our end.
Thanks!

michel-laterman · 2024-04-24T18:07:32Z

From what I can see this could have been caused by the policy output reload work we tried to add; The PRs have been reverted in 8.15 and 8.14 as of this morning

kpollich · 2024-04-24T18:12:19Z

Thanks Michel. @amolnater-qasource can we retest when the next BC is available? There should be one built tomorrow April 25.

ycombinator · 2024-05-10T21:21:24Z

Hi @amolnater-qasource did you get a chance to retest this one? Thanks!

amolnater-qasource · 2024-05-13T06:32:47Z

Hi Team,

We have revalidated this issue on latest 8.14.0 BC4 kibana self-managed environment and found it fixed now:

Observations:

Fleet Server remains Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-05-13T06-30-55Z-00.zip

Screenshot:

Build details:
VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707

Hence we are closing and marking this issue as QA:Validated.

Thanks!

amolnater-qasource added bug Something isn't working Team:Fleet Label for the Fleet team impact:high Short-term priority; add to current release, or definitely next. labels Apr 23, 2024

kpollich assigned michel-laterman Apr 24, 2024

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 30, 2024

amolnater-qasource closed this as completed May 13, 2024

amolnater-qasource added QA:Ready For Testing Code is merged and ready for QA to validate QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels May 13, 2024

ycombinator mentioned this issue May 24, 2024

Add test to ensure that Fleet Server remains Healthy when log level is changed to debug #3582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug. #3485

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug. #3485

amolnater-qasource commented Apr 23, 2024 •

edited

Loading

amolnater-qasource commented Apr 23, 2024

manishgupta-qasource commented Apr 23, 2024

cmacknz commented Apr 23, 2024

cmacknz commented Apr 23, 2024 •

edited

Loading

michel-laterman commented Apr 23, 2024

michel-laterman commented Apr 23, 2024

michel-laterman commented Apr 23, 2024

amolnater-qasource commented Apr 24, 2024

michel-laterman commented Apr 24, 2024

kpollich commented Apr 24, 2024

ycombinator commented May 10, 2024

amolnater-qasource commented May 13, 2024

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug. #3485

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug. #3485

Comments

amolnater-qasource commented Apr 23, 2024 • edited Loading

amolnater-qasource commented Apr 23, 2024

manishgupta-qasource commented Apr 23, 2024

cmacknz commented Apr 23, 2024

cmacknz commented Apr 23, 2024 • edited Loading

michel-laterman commented Apr 23, 2024

michel-laterman commented Apr 23, 2024

michel-laterman commented Apr 23, 2024

amolnater-qasource commented Apr 24, 2024

michel-laterman commented Apr 24, 2024

kpollich commented Apr 24, 2024

ycombinator commented May 10, 2024

amolnater-qasource commented May 13, 2024

amolnater-qasource commented Apr 23, 2024 •

edited

Loading

cmacknz commented Apr 23, 2024 •

edited

Loading