Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API ConficSync - Packages - Out of Sync/Missing #9721

Closed
Wintermute2k6 opened this issue Mar 9, 2023 · 23 comments · Fixed by #9980 or #10013
Closed

API ConficSync - Packages - Out of Sync/Missing #9721

Wintermute2k6 opened this issue Mar 9, 2023 · 23 comments · Fixed by #9980 or #10013
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working ref/NC
Milestone

Comments

@Wintermute2k6
Copy link

Wintermute2k6 commented Mar 9, 2023

Describe the bug

Due to heavy usage of the API Host/Services creation with relatively long names <128 Characters there seems to be a inconsistency with the API Sync behaviour between the Master and the Satellites.
While on the HA-Masters there seems to be fine, during a fresh Config Sync/Deployment the Satellites seem to receive an inconsistent Config with missing parts on one side and another missing parts on the other side.
This leads to both not accepting the newly provided config due to the missing parts of the config which the partner Satellite received and vice versa.

with missing part on one side and another missing part on the other side

This shows in the Icinga check signaling

Last zone sync stage validation failed at 2023-xx-xx 07:52:02 +0100

In the Startup.log there is shown what Host/Services are Missing which variates due to the missing other half.
Also the Synced Zone on both Satellite Partners is never the same but should be.

Satellite1 #> ls /var/lib/icinga2/api/packages/_api/*/conf.d/hosts | wc -l
1106
Satellite2 #> ls /var/lib/icinga2/api/packages/_api/*/conf.d/hosts | wc -l
1152

The Startup.log shows the missing files .. which have (maybe) a problematic naming scheme ?

[2023-03-09 12:28:36 +0100] information/ConfigItem: Committing config item(s).
[2023-03-09 12:28:36 +0100] warning/CheckerComponent: Attribute 'concurrent_checks' for object 'checker' of type 'CheckerComponent' is deprecated and should not be used.
[2023-03-09 12:28:36 +0100] information/ApiListener: My API identity: Satellite1
[2023-03-09 12:28:37 +0100] critical/config: Error: Validation failed for object 'test:dev:test01:aa-rest-functiontest-secure:healthcheck!aa-rest-functiontest-secure:aa-rest-functiontest-secure:test01:dev:secure' of type 'Service'; Attribute 'host_name': Object 'test:dev:test01:aa-rest-functiontest-secure:healthcheck' of type 'Host' does not exist.
Location: in /var/lib/icinga2/api/packages/_api/3cf45951-0f22-495b-855c-59856870f1eb/conf.d/services/test%3Adev%3Atest01%3Aaa-rest-functiontest-secure%3Ahealthcheck%21aa-rest-functiontest-secure%3Aaa-rest-functiontest-secure%3Atest01%3Adev%3Asecure.conf: 4:2-4:69
/var/lib/icinga2/api/packages/_api/3cf45951-0f22-495b-855c-59856870f1eb/conf.d/services/test%3Adev%3Atest01%3Aaa-rest-functiontest-secure%3Ahealthcheck%21aa-rest-functiontest-secure%3Aaa-rest-functiontest-secure%3Atest01%3Adev%3Asecure.conf(2):
...

Problematic Host/Service Name consist out of the following schema:

test:dev:test01:aa-rest-functiontest-secure:healthcheck!aa-rest-functiontest-secure:aa-rest-functiontest-secure:test01:dev:secure

URL Encoded:

test%3Adev%3Atest01%3Aaa-rest-functiontest-secure%3Ahealthcheck%21aa-rest-functiontest-secure%3Aaa-rest-functiontest-secure%3Atest01%3Adev%3Asecure

Also the Issue seems just to present since the update from version 2.11.x to 2.12.x

Expected behavior

Config Sync should be working and should sync properly between the config Master and the Satellites Partners without loosing Config (half) to the partner Node/Satellite and also Sync properly the packages folder.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): icinga2 - The Icinga 2 network monitoring daemon (version: r2.12.9-1)
  • Operating System and version:
    OS name | Red Hat Enterprise Linux Server
    OS Version | 7.9 (Maipo)
  • Icinga Web 2 version and modules (System - About):
  • Config validation (icinga2 daemon -C):
[2023-03-01 15:15:15 +0100] information/cli: Icinga application loader (version: r2.12.9-1)
[2023-03-01 15:15:15 +0100] information/cli: Loading configuration file(s).
[2023-03-01 15:15:19 +0100] information/ConfigItem: Committing config item(s).
[2023-03-01 15:15:20 +0100] information/ApiListener: My API identity: Master01
[2023-03-01 15:15:29 +0100] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 80.2667/s (4816/min 4816/5min 4816/15min);
[2023-03-01 15:15:30 +0100] information/WorkQueue: #6 (InfluxdbWriter, influxdbp01) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-03-01 15:15:30 +0100] information/WorkQueue: #8 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-03-01 15:15:30 +0100] information/WorkQueue: #9 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-03-01 15:15:30 +0100] information/WorkQueue: #7 (InfluxdbWriter, influxdbp02) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-03-01 15:15:39 +0100] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 56, rate: 86.8/s (5208/min 5208/5min 5208/15min);
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 2 Users.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 68 ScheduledDowntimes.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 7 TimePeriods.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 6323 Zones.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 4 ServiceGroups.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 164214 Services.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 27877 Hosts.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 2 EventCommands.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 2 NotificationCommands.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 220503 Notifications.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 345 HostGroups.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 6331 Endpoints.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 924 Downtimes.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 125 Comments.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 11 ApiUsers.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 406 CheckCommands.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 2 InfluxdbWriters.
[2023-03-01 15:16:06 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2023-03-01 15:16:07 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2023-03-01 15:16:07 +0100] information/cli: Finished validating the configuration file(s).

Additional context

ref/NC/777526

@julianbrost
Copy link
Contributor

during a fresh Config Sync/Deployment the Satellites

What exactly does that mean? Setting up a new satellite and this happens the first time that satellite receives any configuration?

In general, logs from the time the host should have been synced would be interesting. When this was probably depends on the previous question, could be the previous connection to parent/same zone nodes or the time this host was created via the API.

The Startup.log shows the missing files .. which have (maybe) a problematic naming scheme ?

Is there a particular reason to believe that it's related to the names? Like are there also objects with more regular names and this only happens for the ones with these exotic names?

I don't want to fully rule it out now, but if it worked to create the file for the service, that should also have worked for the host as the service file name includes the host name and is even longer.

@K0nne
Copy link
Contributor

K0nne commented Mar 9, 2023

during a fresh Config Sync/Deployment the Satellites
What exactly does that mean?

Because of missing host-objects in the /var/lib/icinga2/api/packages-folder we stopped Icinga on the satellites, deleted the folder /var/lib/icinga2/api, as well as the state-file and started the icinga-process again.

In the first Minute after the initial rebuild of the api folder the config check on the Satelliten was ok. A few minutes later, the number of hosts in the packages-folder folder was still growing, but after it stopped it never reached the number of hosts in the master-zone's api packages-folder, which should be synced to the soecific zone. It were always less objects and after another config check (~5min later) the satellits's config were broken again because of missing host objects.

Is there a particular reason to believe that it's related to the names? Like are there also objects with more regular names and this only happens for the ones with these exotic names?

Exactly. In every occurance of the problem only objects of this specific naming scheme were mentioned by the config check.

Sidenote:
We initially found this issue because on the satellites in the packages-folder folder were hundreds of old api-created hosts which had already been deleted by using the API. It seems the delete command was only successful within the master-zone and never reached the satellite-zones. Multiple zones were affected. In every case it were hosts with this specific naming scheme. The checks of those 'undead' hosts were still executed 'under the hood'. We could not find their existence by using the api, icinga2 object list or the icingaweb2 UI. We only found them by accident and seeing this amound of objects, which we were unaware of, was kind of shocking. We fixed it with the help of the support by purging the api folder of all satellites of a zone at the same time, but seeing the ongoing sync problem now, it all seems to be part of a bigger problem. And the specific naming scheme is the best clue we have at this point.

We will send you the logs.

@K0nne
Copy link
Contributor

K0nne commented Mar 10, 2023

I have uploaded the logfile.

@julianbrost
Copy link
Contributor

julianbrost commented Mar 13, 2023

In the logs, there are (as you pointed out already) errors that the affected created object were not created because they import templates that don't exist yet and these templates are only synced at a later point.

I suspect this might have been introduced by #7936 (2.12.0, backported to 2.11.5 in #8093): that PR changed that file-based config updates are handled in the background, which can result in a object-based update to be applied in a different order starting from these versions.

Also the Issue seems just to present since the update from version 2.11.x to 2.12.x

Was the version you upgraded from older than 2.11.5? That would be consistent with my theory then.

Is there a particular reason to believe that it's related to the names? Like are there also objects with more regular names and this only happens for the ones with these exotic names?

Exactly. In every occurance of the problem only objects of this specific naming scheme were mentioned by the config check.

I think the names are just a coincidence here. Maybe these are the only objects using templates.

@julianbrost
Copy link
Contributor

Looks like that problem already showed up while testing in #7742 (comment), but was attributed to a version mismatch by mistake (#7742 (comment) (1.)).

@julianbrost julianbrost added bug Something isn't working area/configuration DSL, parser, compiler, error handling area/distributed Distributed monitoring (master, satellites, clients) labels Mar 13, 2023
@K0nne
Copy link
Contributor

K0nne commented Mar 13, 2023

We have upgraded from 2.11.11 to 2.12.9.
Those objects are indeed the only ones created by api, which are using templates.

@julianbrost
Copy link
Contributor

We have upgraded from 2.11.11 to 2.12.9.

How confident are you that this bug is new between these versions, i.e. did you do the same on 2.11.11 and it worked? Note that to trigger what I think is the bug here, an API-created object must reference something that comes from the file-based sync (/etc/icinga2/zones.d or Director or the config packages API) and that file must be synced on the same connection and must not exist on the target node before (otherwise the reference would work (potentially referring an outdated version of that file)).

@K0nne
Copy link
Contributor

K0nne commented Mar 13, 2023

I am doing this scenario since the beginning of november 2022 and it worked flawless (in terms of a valid config, being built every hour). The problem emerged after our upgrade to 2.12.9 during the time, when we found the outdated config on our satellites. To fix this we deleted the /var/lib/icinga2/api folder. After this the problem occured.

It might be possible that this delete-operation (for the first time since november 2022, afaik) triggered that bug for the first time and it is just a coincidence.

@julianbrost
Copy link
Contributor

Looks like what's going on is a bit more complicated than what I originally imagined. If just some objects were missing because they reference templates that come from not yet synced and loaded files, that should fix itself with the next reload (as it should also be triggered after the files were synced) as the objects are sent again. However, due to #7936 moving the file sync to the background, it may actually see intermediate files from the API/object-based config sync that cause it to fail and the aforementioned reload never happens.

I'd like to confirm that theory with your logs but the debug.log and startup.log are from different times. Can you please also upload logs from around 2023-03-09 12:28 corresponding to the startup.log you already uploaded. Doesn't matter if these aren't debug logs, the normal icinga2.log also should contain what I'm looking for.

@aheinhold
Copy link

Hi @julianbrost i have uploaded the requested logfile to netways Nextcloud.

@julianbrost
Copy link
Contributor

That log doesn't seem to confirm my theory unfortunately. I'd only have expected config validation errors ("Config validation failed for staged cluster config sync") close to object sync error ("Could not create object"), but the latter only seem to happen around 10:01, while there are multiple config validation errors starting at 10:28. But we're currently trying to replicate the setup, so maybe we will just see the same behavior there.

@K0nne
Copy link
Contributor

K0nne commented Mar 14, 2023

In the master-zone the config is always valid.
If you have any questions, we are happy to help.

@julianbrost
Copy link
Contributor

Question about the hosts that show up in error messages like that one (also the ones from the startup.log you uploaded to Nextcloud which were different names):

Attribute 'host_name': Object 'test:dev:test01:aa-rest-functiontest-secure:healthcheck' of type 'Host' does not exist.

Were these objects created using the /v1/objects API? Or are these from a file-based config, like /etc/icinga2 or the /v1/config API? At the time of the error, were these hosts supposed to exist? Also, were there any modifications (including create/delete) made to these hosts?

@K0nne
Copy link
Contributor

K0nne commented Mar 16, 2023

Those objects were created with the /v1/objects api. The objects exist at the time of the error in the master zone and the config there is valid.

Those hosts are placeholders for the underlying services , which are more volatile. If there's a referencing service, its host should always exist. Those hosts are not modified in any way. The are automaticaly deleted if their last service is deleted.

We temporarly mitigated the problem by copying the missing hosts from the master zone to the satellites. After this the satellites config is valid.

If there's the need, you can have a look at the system.

@K0nne
Copy link
Contributor

K0nne commented Mar 27, 2023

This week we plan to trigger the bug again by deleting the API folder in another zone. There we have also surviving api objects, which were already deleted in the master zone, which we need to purge.

@julianbrost
Copy link
Contributor

Can you please create backups of /var/lib/icinga2 and /etc/icinga2 before and after each operation you perform so that we can take a look at what happened with specific objects over time afterwards?

@julianbrost
Copy link
Contributor

Please do so on both masters and satellites, even if you're just performing some operation on one of them.

@K0nne
Copy link
Contributor

K0nne commented Mar 31, 2023

I have uploaded the results.

@K0nne
Copy link
Contributor

K0nne commented Jan 17, 2024

Hello,
are there any updates on this? Yesterday the problem has re-ermerged. Our infrastructure uses 2.13.9.

@K0nne
Copy link
Contributor

K0nne commented Jan 17, 2024

One of our satellites suddenly had a invalid config with missing components and was unable to sync again into a valid stage afterwards. This was satellite2 of a zone. satellite1 had a valid config. On satellite2 we stopped icinga, removed /var/lib/icinga2/api/ + the state-file and started icinga again. For 1-2min its config was valid, before the sync entered a invalid state again.

At this point we made a backup from the api-dir of each satellite of the zone. We removed /var/lib/icinga2/api/ + the state-file of both satellites and restarted icinga on both machines. after this both satellites showed the same broken sync behaviour and satellite1 was now missing the same objects als satellite2. In this case api-created services were missing referencing templates from global-templates.

We restored the valid config backup of satellite1, its config check was ok und we got icinga running again. On satellite2 we found that /var/lib/icinga2/api/zones/global-templates/ was empty. Then we copied the directory /var/lib/icinga2/api/zones/global-templates from satellite1 to satellite2. The configcheck on satellite2 now showed missing api-created hosts. We copied then all hosts from the api-dir /var/lib/icinga2/api/packages/_api//conf.d/hosts/ of satellite1 to satellite2. Since then the config is valid again.

@yhabteab
Copy link
Member

Hi @K0nne, sorry for the delay! We were able to reproduce your issue of some objects created via the API being magically disappearing on the satellite endpoints, and we're working on it!

Thank you for your exhaustive contributions!

@K0nne
Copy link
Contributor

K0nne commented Jan 24, 2024

We are happy to hear this! This issue is haunting us every now and then for years now.

@yhabteab yhabteab self-assigned this Jan 25, 2024
@yhabteab yhabteab added area/runtime Downtimes, comments, dependencies, events and removed area/configuration DSL, parser, compiler, error handling labels Jan 26, 2024
@yhabteab yhabteab added this to the 2.15.0 milestone Jan 26, 2024
@julianbrost
Copy link
Contributor

Unfortunately, #9980 had to be reverted as it caused other problems as described in #10012.

Reopening this issue until it's properly fixed in #10013.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working ref/NC
Projects
None yet
5 participants