Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in the SOFT HARD check logic #368

Open
dirtyren opened this issue Aug 4, 2021 · 2 comments
Open

Problem in the SOFT HARD check logic #368

dirtyren opened this issue Aug 4, 2021 · 2 comments

Comments

@dirtyren
Copy link
Contributor

dirtyren commented Aug 4, 2021

Hello,

I found this problem bellow.
The host went down and naemon set the service as CRITICAL HARD, but when the Host came back UP, naemon set the HOST to OK SOFT. This broke some availability reports that depend on HARD states to make the calculations.
The question is, should the service not be set to OK HARD when the Host came back up?

Tks.

[Fri Jul 23 03:39:31 2021] INITIAL SERVICE STATE: HOSTDEMO;SVCDEMO;OK;HARD;1;OK
[Fri Jul 23 21:41:11 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;1;CRITICAL - 192.168.54.32: rta nan, lost 100%
[Fri Jul 23 21:41:21 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;2;CRITICAL - 192.168.54.32: rta nan, lost 100%
[Fri Jul 23 21:41:37 2021] HOST ALERT: HOSTDEMO;DOWN;HARD;3;CRITICAL - 192.168.54.32: rta nan, lost 100%
[Fri Jul 23 21:42:57 2021] SERVICE INFO: HOSTDEMO;SVCDEMO; Service switch to hard down state due to host down.
[Fri Jul 23 21:42:57 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;HARD;1;CRITICAL - cannot connect
[Fri Jul 23 21:46:57 2021] HOST ALERT: HOSTDEMO;UP;HARD;1;OK - 192.168.54.32: , rta 0.259ms, lost 0%
[Fri Jul 23 21:47:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;1;CRITICAL - cannot connect
[Fri Jul 23 21:49:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;2;CRITICAL - cannot connect
[Fri Jul 23 21:51:18 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;OK;SOFT;3;OK

@dirtyren
Copy link
Contributor Author

I got another behavior , naemon did not generate a state change for the service to OK, but the INITIAL LOG STATE changed to OK, like this
[Thu Jun 17 18:43:01 2021] SERVICE INFO: PABX;Port_8443; Service switch to hard down state due to host down.
[Thu Jun 17 18:43:01 2021] SERVICE ALERT: PABX;Port_8443;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
[Thu Jun 17 18:50:21 2021] HOST ALERT: PABX;UP;HARD;1;OK - x.x.x.x: , rta 0.446ms, lost 0%
[Thu Jun 17 18:59:35 2021] INITIAL HOST STATE: PABX;UP;HARD;1;OK - x.x.x.x: , rta 0.234ms, lost 0%
[Thu Jun 17 18:59:35 2021] INITIAL SERVICE STATE: PABX;Port_8443;OK;HARD;1;TCP OK - 0.000 second response time on x.x.x.x on port 8443

If you check this, the plugin output for the service when CRITICAL was CRITICAL - Socket timeout after 10 seconds, when naemon was restarted, the plugin output changed for the OK exit, but the SERVICE ALERT for the OK HARD states was not generated.
If you see, the HOST came back to OK 9minutes before naemon was restarted, and no SERVICE ALERT OK state was generate for the service.

[]s.

@ccztux
Copy link
Contributor

ccztux commented Aug 6, 2024

Hello,

I found this problem bellow. The host went down and naemon set the service as CRITICAL HARD, but when the Host came back UP, naemon set the HOST to OK SOFT. This broke some availability reports that depend on HARD states to make the calculations. The question is, should the service not be set to OK HARD when the Host came back up?

Tks.

[Fri Jul 23 03:39:31 2021] INITIAL SERVICE STATE: HOSTDEMO;SVCDEMO;OK;HARD;1;OK [Fri Jul 23 21:41:11 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;1;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:41:21 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;2;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:41:37 2021] HOST ALERT: HOSTDEMO;DOWN;HARD;3;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:42:57 2021] SERVICE INFO: HOSTDEMO;SVCDEMO; Service switch to hard down state due to host down. [Fri Jul 23 21:42:57 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;HARD;1;CRITICAL - cannot connect [Fri Jul 23 21:46:57 2021] HOST ALERT: HOSTDEMO;UP;HARD;1;OK - 192.168.54.32: , rta 0.259ms, lost 0% [Fri Jul 23 21:47:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;1;CRITICAL - cannot connect [Fri Jul 23 21:49:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;2;CRITICAL - cannot connect [Fri Jul 23 21:51:18 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;OK;SOFT;3;OK

Unfortunately i can confirm this behaviour in Naemon 1.4.1

[Tue Jun 25 03:40:01 2024] CURRENT SERVICE STATE: localhost;NCPA Connection;OK;HARD;1;OK: NCPA Agent (Version: 2.1.6, OS: Windows) is accessible via API (HTTPS, Port: 5693)
[Tue Jun 25 10:09:50 2024] SERVICE DOWNTIME ALERT: localhost;NCPA Connection;STARTED; Service has entered a period of scheduled downtime
[Tue Jun 25 10:09:50 2024] SERVICE NOTIFICATION SUPPRESSED: localhost;NCPA Connection;Notifications about SCHEDULED DOWNTIME events blocked for this object.
[Tue Jun 25 10:15:13 2024] SERVICE INFO: localhost;NCPA Connection; Service switch to hard down state due to host down.
[Tue Jun 25 10:15:13 2024] SERVICE ALERT: localhost;NCPA Connection;CRITICAL;HARD;1;CRITICAL - Connection to API (HTTPS, Port: 5693) failed.  Connection error: Connection timed out after 58000 milliseconds
[Tue Jun 25 10:24:20 2024] SERVICE ALERT: localhost;NCPA Connection;CRITICAL;SOFT;1;CRITICAL - Connection to API (HTTPS, Port: 5693) failed.  Connection error: Failed to connect to 127.0.0.1 port 5693: Connection refused
[Tue Jun 25 10:27:24 2024] SERVICE ALERT: localhost;NCPA Connection;OK;SOFT;2;OK: NCPA Agent (Version: 2.1.6, OS: Windows) is accessible via API (HTTPS, Port: 5693)
[Tue Jun 25 10:42:46 2024] SERVICE ALERT: localhost;NCPA Connection;CRITICAL;SOFT;1;CRITICAL - Connection to API (HTTPS, Port: 5693) failed.  Connection error: Failed to connect to 127.0.0.1 port 5693: Connection refused
[Tue Jun 25 10:45:52 2024] SERVICE ALERT: localhost;NCPA Connection;OK;SOFT;2;OK: NCPA Agent (Version: 2.1.6, OS: Windows) is accessible via API (HTTPS, Port: 5693)
[Tue Jun 25 16:00:00 2024] SERVICE DOWNTIME ALERT: localhost;NCPA Connection;STOPPED; Service has exited from a period of scheduled downtime

The OK HARD state is missing after the OK SOFT state as described in the documentation

image

The OK SOFT state changes to OK HARD by the CURRENT SERVICE STATE entry when the log file will be rotated the next day.

This is how it looks like in a thruk availability report:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants