Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config Sync - failed reloads due to uncomplete syncs #7742

Closed
Clasko opened this issue Jan 8, 2020 · 25 comments · Fixed by #7936
Closed

Config Sync - failed reloads due to uncomplete syncs #7742

Clasko opened this issue Jan 8, 2020 · 25 comments · Fixed by #7936
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working ref/NC
Milestone

Comments

@Clasko
Copy link

Clasko commented Jan 8, 2020

Describe the bug

We're facing an issue with failed config reloads due to uncomplete syncs of our global-templates zone.
But this only happens in a zone with a second satellite hierarchy (Master -> Satellite -> Satellite). Satellites without Childs are not affected by this.
We can only fix this by purging /var/lib/icinga2/api/zones and /var/lib/icinga2/api/zones-stage on the satellites.

This only happens on object creation or deletion. Changes on already existing objects does not trigger this issue.

Error Message Example:

Error: Function call 'opendir' for file '/var/lib/icinga2/api/zones-stage//global-templates/_etc/credentials' failed with error code 2, 'No such file or directory'

To Reproduce

  1. Create or Delete a new Host Object in the affected zone
  2. Trigger a Config reload

Expected behavior

Working Config sync and reload on all Icinga nodes

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.2-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Red Hat Enterprise Linux Server
  Platform version: 7.7 (Maipo)
  Kernel: Linux
  Kernel version: 3.10.0-1062.1.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
  • Operating System and version: RHEL 7.7
  • Enabled features (icinga2 feature list):
Disabled features: compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker command mainlog
  • Config validation (icinga2 daemon -C):
[2020-01-08 11:47:29 +0100] information/cli: Icinga application loader (version: 2.11.2-1)
[2020-01-08 11:47:29 +0100] information/cli: Loading configuration file(s).
[2020-01-08 11:47:29 +0100] information/ConfigItem: Committing config item(s).
[2020-01-08 11:47:29 +0100] information/ApiListener: My API identity: sattelite.domain.com
... (only some apply rules without matches on this satellite)
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 743 Dependencies.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 8 NotificationCommands.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 2722 Notifications.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 173 HostGroups.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 213 Hosts.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 32 Downtimes.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 5 Comments.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 4 Zones.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 6 Endpoints.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 6 UserGroups.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 307 CheckCommands.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 8 TimePeriods.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 11 Users.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 1017 Services.
[2020-01-08 11:47:31 +0100] information/ConfigItem: Instantiated 35 ServiceGroups.
[2020-01-08 11:47:32 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2020-01-08 11:47:32 +0100] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

Additional context

N/A

@Clasko
Copy link
Author

Clasko commented Jan 8, 2020

@dnsmichi
Copy link
Contributor

dnsmichi commented Jan 8, 2020

Please also share the output of ls -lahR /var/lib/icinga2/api/zones-stage/ of the affected satellite host.

@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) needs feedback We'll only proceed once we hear from you again labels Jan 8, 2020
@Clasko
Copy link
Author

Clasko commented Jan 8, 2020

zone-stage_output.txt

The attached file is from one of our "middle" Satellites (from the "aws-frankfurt-satellite" zone). I hope is enough and helps. Let me know if you need the output from all 4 affected satellites. (anonymizing is always a little bit difficult)

@cite
Copy link

cite commented Jan 19, 2020

We were facing the same issue: https://community.icinga.com/t/global-configuration-zone-missing-check-commands/2976/6

Should you need more debugging data, we would be happy to switch our config sync back to Icinga2 and send you logfiles.

@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Mar 5, 2020
@Al2Klimov Al2Klimov self-assigned this Mar 16, 2020
@Al2Klimov
Copy link
Member

Note: The example error message implies an unexpectedly changed FS tree, but we definitively lock /var/lib/icinga2/api/zones-stage exclusively, so only one changes it at a time.

@Al2Klimov
Copy link
Member

Hello @Clasko and thank you for reporting!

independent of this issue you should upgrade to v2.11.3 not to have a lot of other trouble.

Best,
AK

@Al2Klimov
Copy link
Member

@Clasko Please could you test v2.11.3 + #7917: https://git.icinga.com/packaging/rpm-icinga2/-/jobs/45459 / "Job artifacts" / "Download"

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Mar 17, 2020
@Clasko
Copy link
Author

Clasko commented Mar 17, 2020

I'm on vacation the next 2 weeks. I will see if a colleague can do the testing.

@Al2Klimov
Copy link
Member

If they can't and the artifacts disappear – just let me know once you'll be going to do the tests and I'll re-create the artifacts.

@Al2Klimov
Copy link
Member

Did you upgrade all of the nodes to the same version? If no, please share the Icinga 2 versions of all nodes in both the zone of the affected node and all parent zones.

Also please share the output of find /var/lib/icinga2/api/zones* -name .authoritative.

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Apr 2, 2020
@Al2Klimov
Copy link
Member

Also: Which zones do you have config for and in which dir on the affected node?

ls /etc/icinga2/zones.d /var/lib/icinga2/api/zones*

@Clasko
Copy link
Author

Clasko commented Apr 2, 2020

I've upgrades all (from my point of view) affected nodes.

Which means:

Master: 2.11.3-1
Affected Satellites under master: 2.12.0-rc1-3-g2eff305
2. Hierarchy Satellites (Childs of the Satellites above): 2.12.0-rc1-3-g2eff305

The issue currently occurs only on this 4 (2 HA zones) Satellites. We have other Satellites als childs of our master which a not affected by this issue. These satellites are on version 2.11.2-1.

Output of the find command on our config master. The output on the affected satellites is empty:

/var/lib/icinga2/api/zones/global-templates/.authoritative
/var/lib/icinga2/api/zones/aws-frankfurt-satellite/.authoritative (affected zone)
/var/lib/icinga2/api/zones/customer1-satellite-nes/.authoritative
/var/lib/icinga2/api/zones/customer2-satellite-kar/.authoritative
/var/lib/icinga2/api/zones/config-ha-master/.authoritative
/var/lib/icinga2/api/zones/customer3-satellite/.authoritative (affected zone)
/var/lib/icinga2/api/zones/customer4-satellite/.authoritative
/var/lib/icinga2/api/zones/customer5-satellite/.authoritative

I can not share an unanonymizing output of our zone names as it contains customer names on GitHub.
We placing the configs directly in /etc/icinga2/zones.d/customer3-satellite. "customer3-satellite" is a child of "aws-frankfurt-sallite" (which is a child of "config-ha-master").

Sorry if my answer are a bit confusing due to anonymizing my outputs. I can provide raw output if you can provide me a nextcloud filedrop link or via netways ticket #663455 if this helps.

@Al2Klimov
Copy link
Member

  1. Higher nodes (hierarchically) should not run an older version that lower ones. Could you upgrade the master(s) to the same version as the affected satellites?
  2. Fine. One point of failure fewer.
  3. OK, let's simplify the question: Are there any dirs on the affected node in /etc/icinga2/zones.d?

@Clasko
Copy link
Author

Clasko commented Apr 3, 2020

to 1) I know and follow this rule on stable releases but i'm a bit careful with Snapshot or even RPMs directly from the master branch in a production enviroment if not absolutly necessary. I will try to reproduce the issue in our test enviroment but i had no luck in the past. I will reconsider when my next attempts fails again.
to 3) No, none of our satellites has local configurations in /etc/icinga2/zones.d

@Al2Klimov
Copy link
Member

3: Fine. One point of failure fewer.

1: Snapshots are RPMs directly from the master branch. But my RPMs are neither of those. I know customers' stability requirements and you can fully trust me: If I say "This packages contain version X + PR Y", the packages won't any line of code more.

@Clasko
Copy link
Author

Clasko commented Apr 3, 2020

I've upgrades our two masters to 2.12.0-rc1-3-g2eff305 and now i'm unable to reproduce the issue! Looks good so far, thank you! :)

@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Apr 3, 2020
@Al2Klimov Al2Klimov self-assigned this Apr 3, 2020
@Al2Klimov
Copy link
Member

ref/NC/663455

@Al2Klimov Al2Klimov added bug Something isn't working ref/NC labels Apr 3, 2020
Al2Klimov added a commit that referenced this issue Apr 3, 2020
@Al2Klimov
Copy link
Member

Note

The have been three problems:

  1. The config sync is incomplete – solved by upgrading the master to a version not lower that the satellite (v2.12-rc1)
  2. There are multiple concurrent config syncs from different nodes happening not mutually exclusive and corrupting the directory structure – solved by ApiListener::ConfigUpdateHandler(): make the whole process mutually exclusive #7936
  3. startup.log disappears – solved by Place startup.log and status in /var/lib/icinga2/api, not /var/lib/icinga2/api/zones-stage #7961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working ref/NC
Projects
None yet
4 participants