Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkable: send state notifications after suppression if and only if the state differs compared to before the suppression started #9207

Merged
merged 3 commits into from
Mar 7, 2022

Conversation

julianbrost
Copy link
Contributor

@julianbrost julianbrost commented Feb 4, 2022

This PR changes the behavior of state notifications (problem, recovery) following a suppression condition (unreachable, in downtime, acknowledged) so that notification is sent if and only if the hard state after the suppression is different from the (last) hard state from before the suppression. If the suppression ends while the checkable is in a soft state, the notification is further delayed until after a hard state was reached.

This is achieved by the following changes:

  1. While the suppression is ongoing, the NotificationProblem and NotificationRecovery flags are never removed from the suppressed_notifications bitset, they may only be added. If they became obsolete is only checked after the suppression reason is gone when actually trying to send them. So actually both flags may be set at the same time. They can be interpreted as "at some point during the suppression, this notification would have been sent if there was no suppression".
  2. When either of these two flags is set for the first time, the previous state is saved in state_before_suppression. When the suppression ends, the current state is compared with this value to decide if a notification should actually be sent.

Note for reviewers: better look at the two commits separately. The one adding the tests has to expose the FireSuppressedNotifications function that was previously only accessible from the same compilation unit. This adds quite some bloat in the overall diff that does not introduce any change in behavior.

Limitations

  • Even when the fix is deployed, checkables which entered their suppression state on an older version may still receive one extra problem notification or miss one recovery notification. This happens because current version don't store enough information to reliably reconstruct the state before the suppression started, so there isn't much we can do about this. (Example: Both OK -> CRITICAL and WARNING -> CRITICAL currently result in suppressed_notifications = NotificationProblem).

Tests

Automatic tests

The PR adds test case is added which iterates over all sequences of service states of length 4, applies them to a service while in downtime and checks if the appropriate notifications as described before are generated. This should test most of the new logic.

Manual tests

In order to test the new cluster sync message and also have some general test that everything also looks fine outside of the test suite, I performed the following test:

  1. Create a total of 6 services, 3 of which are used for testing the feature only on a single node, and the other 3 are used to see if the same notifications are also generated, when a failover to the other node in an HA zone happens.
  2. Start only the first node in the HA zone.
  3. Bring all 6 services into a CRITICAL state.
  4. Start downtimes for all 6 services.
  5. Bring all 6 services into a WARNING state (this is the point where the new code sets the state_before_suppression attribute).
  6. Bring the first set of 3 services into OK, WARNING, PROBLEM states.
  7. End the downtimes for the first set of 3 services.
  8. Observe the notifications generated (service in OK should receive a recovery notification, in WARNING a problem notification and in CRITICAL no notification).
  9. Start the second node in the HA zone and give it some time to synchronize everything.
  10. Stop the first node.
  11. Repeat steps 6, 7, 8 for the second set of 3 services.

Snippet from my test configuration (you have to add standard apply rules for notifications):

object Host "github-9207" {
	check_command = "dummy"
}

apply Service for (_ in ["none-1", "rec-1", "problm-1", "none-2", "rec-2", "problm-2"]) {
	assign where host.name == "github-9207"
	check_command = "dummy"
	max_check_attempts = 1
	enable_active_checks = false
}

Note that the object names were chosen carefully to that the authority for the notification objects for *-1 is on master-1 and for *-2 on master-2 when both nodes are running to avoid triggering bug #9265. Otherwise some of the notifications observed in step 8 may be duplicated on the second node in step 9.

I've written a test script for my local test cluster in docker compose. This probably won't work anywhere else out of the box but should give enough information to reproduce the tests in detail.

My test script
#!/bin/sh

set -eu

api_req() {
	node=$1
	shift
	docker compose exec -T "$node" curl -skSu root:icinga -H 'Accept: application/json' "$@"
}

api_version() {
	api_req $1 https://localhost:5665/v1 | jq -r '.results[0].version'
}

api_check_result() {
	case "$3" in
		OK)       state=0 ;;
		WARNING)  state=1 ;;
		CRITICAL) state=2 ;;
	esac
	api_req "$1" https://localhost:5665/v1/actions/process-check-result -d@- <<EOF
{
	"type": "Service",
	"service": "github-9207!$2",
	"exit_status": $state,
	"plugin_output": "test"
}
EOF
	echo
}

api_schedule_downtime() {
	api_req "$1" https://localhost:5665/v1/actions/schedule-downtime -d@- <<EOF
{
	"type": "Service",
	"service": "github-9207!$2",
	"author": "test",
	"comment": "test",
	"start_time": "$(date -d '-1 hour' +%s)",
	"end_time": "$(date -d '+1 hour' +%s)"
}
EOF
	echo
}

api_remove_downtime() {
	api_req "$1" https://localhost:5665/v1/actions/remove-downtime -d@- <<EOF
{
	"type": "Service",
	"service": "github-9207!$2"
}
EOF
	echo
}

log() {
	printf '[%s] %s\n' "$(date -Is)" "$*"
}

log "Ensure clean initial state"
docker compose up -d master-1 master-2
sleep 60
for service in {none,rec,problm}-{1,2}; do
	api_remove_downtime master-1 "$service"
done
sleep 5
for service in {none,rec,problm}-{1,2}; do
	api_check_result master-1 "$service" CRITICAL
done
sleep 60
docker compose down # clear logs

log "Starting master-1"
docker compose up -d master-1
sleep 60

log "master-1 version"
api_version master-1

log "Ensure all services are CRITICAL"
for service in {none,rec,problm}-{1,2}; do
	api_check_result master-1 "$service" CRITICAL
done
sleep 5

log "Schedule downtimes for all services"
for service in {none,rec,problm}-{1,2}; do
	api_schedule_downtime master-1 "$service"
done
sleep 5

log "Make all services WARNING"
for service in {none,rec,problm}-{1,2}; do
	api_check_result master-1 "$service" WARNING
done
sleep 5

log "Bring services *-1 into final states"
api_check_result master-1 none-1   CRITICAL
api_check_result master-1 rec-1    OK
api_check_result master-1 problm-1 WARNING
sleep 5

log "Cancel downtimes for services *-1"
for service in {none,rec,problm}-1; do
	api_remove_downtime master-1 "$service"
done
sleep 60

log "Notification logs from master-1"
docker compose logs master-1 | grep -F /Notification: | grep -F github-9207 || true
log "Notification logs from master-1 (filtered)"
docker compose logs master-1 | grep -F /Notification: | grep -F github-9207 | grep -F Sending | grep -F -e Problem -e Recovery || true

log "Starting master-2"
docker compose up -d master-2
sleep 60

log "master-2 version"
api_version master-2

log "Stopping master-1"
docker compose stop master-1
sleep 60

log "Bring services *-2 into final states"
api_check_result master-2 none-2   CRITICAL
api_check_result master-2 rec-2    OK
api_check_result master-2 problm-2 WARNING
sleep 5

log "Cancel downtimes for services *-2"
for service in {none,rec,problm}-2; do
	api_remove_downtime master-2 "$service"
done
sleep 60

log "Notification logs from master-2"
docker compose logs master-2 | grep -F /Notification: | grep -F github-9207 || true
log "Notification logs from master-2 (filtered)"
docker compose logs master-2 | grep -F /Notification: | grep -F github-9207 | grep -F Sending | grep -F -e Problem -e Recovery || true

docker compose down

Before (db321b9)

Both nodes send a notification for the CRITICAL -> WARNING -> CRITICAL service (none-1/2) and miss the notification for the CRTICAL -> WARNING -> OK (rec-1/2) service.

master-1:

[2022-02-28 15:58:10 +0100] information/Notification: Sending 'Problem' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
[2022-02-28 15:58:10 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'

master-2:

[2022-02-28 16:01:20 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
[2022-02-28 16:01:20 +0100] information/Notification: Sending 'Problem' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'

Full output
[2022-02-28T15:54:27+01:00] Ensure clean initial state
Container cluster-master-2-1  Creating
Container cluster-master-1-1  Creating
Container cluster-master-1-1  Created
Container cluster-master-2-1  Created
Container cluster-master-2-1  Starting
Container cluster-master-1-1  Starting
Container cluster-master-2-1  Started
Container cluster-master-1-1  Started
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
Container cluster-master-2-1  Stopping
Container cluster-master-2-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopped
Container cluster-master-1-1  Removing
Container cluster-master-1-1  Removed
Container cluster-master-2-1  Stopped
Container cluster-master-2-1  Removing
Container cluster-master-2-1  Removed
Network cluster_default  Removing
Network cluster_default  Removed
[2022-02-28T15:56:42+01:00] Starting master-1
Network cluster_default  Creating
Network cluster_default  Created
Container cluster-master-1-1  Creating
Container cluster-master-1-1  Created
Container cluster-master-1-1  Starting
Container cluster-master-1-1  Started
[2022-02-28T15:57:42+01:00] master-1 version
v2.13.0-207-gdb321b9fc
[2022-02-28T15:57:43+01:00] Ensure all services are CRITICAL
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
[2022-02-28T15:57:49+01:00] Schedule downtimes for all services
{"results":[{"code":200,"legacy_id":13,"name":"github-9207!none-1!ac7e4ff4-26fc-4d9d-b8c0-2a0d19a4b4ff","status":"Successfully scheduled downtime 'github-9207!none-1!ac7e4ff4-26fc-4d9d-b8c0-2a0d19a4b4ff' for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"legacy_id":14,"name":"github-9207!none-2!0e612687-a83e-4fbc-99d3-92a0c825b655","status":"Successfully scheduled downtime 'github-9207!none-2!0e612687-a83e-4fbc-99d3-92a0c825b655' for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"legacy_id":15,"name":"github-9207!rec-1!b155a127-bbd7-4a8f-bcce-275140a7d22b","status":"Successfully scheduled downtime 'github-9207!rec-1!b155a127-bbd7-4a8f-bcce-275140a7d22b' for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"legacy_id":16,"name":"github-9207!rec-2!39be7890-7eea-445c-ab8e-cba021c7f77d","status":"Successfully scheduled downtime 'github-9207!rec-2!39be7890-7eea-445c-ab8e-cba021c7f77d' for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"legacy_id":17,"name":"github-9207!problm-1!ed904c13-51f0-4ae9-ab2f-b1a6dc352b27","status":"Successfully scheduled downtime 'github-9207!problm-1!ed904c13-51f0-4ae9-ab2f-b1a6dc352b27' for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"legacy_id":18,"name":"github-9207!problm-2!93155660-7ecf-416a-811f-20439a55eb9f","status":"Successfully scheduled downtime 'github-9207!problm-2!93155660-7ecf-416a-811f-20439a55eb9f' for object 'github-9207!problm-2'."}]}
[2022-02-28T15:57:55+01:00] Make all services WARNING
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
[2022-02-28T15:58:00+01:00] Bring services *-1 into final states
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
[2022-02-28T15:58:06+01:00] Cancel downtimes for services *-1
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-1' and 0 child downtimes."}]}
[2022-02-28T15:59:06+01:00] Notification logs from master-1
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!none-1!dummy-service-notification' for checkable 'github-9207!none-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!rec-1!dummy-service-notification' for checkable 'github-9207!rec-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:57:49 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!problm-1!dummy-service-notification' for checkable 'github-9207!problm-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:57:50 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:57:50 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:58:06 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:58:06 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!none-1!dummy-service-notification' for checkable 'github-9207!none-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:58:06 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:58:06 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!rec-1!dummy-service-notification' for checkable 'github-9207!rec-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:58:06 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:58:06 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!problm-1!dummy-service-notification' for checkable 'github-9207!problm-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:58:10 +0100] information/Notification: Sending 'Problem' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:58:10 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:58:10 +0100] information/Notification: Completed sending 'Problem' notification 'github-9207!none-1!dummy-service-notification' for checkable 'github-9207!none-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-02-28 15:58:10 +0100] information/Notification: Completed sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for checkable 'github-9207!problm-1' and user 'dummy' using command 'dummy'.
[2022-02-28T15:59:07+01:00] Notification logs from master-1 (filtered)
cluster-master-1-1  | [2022-02-28 15:58:10 +0100] information/Notification: Sending 'Problem' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-02-28 15:58:10 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
[2022-02-28T15:59:07+01:00] Starting master-2
Container cluster-master-2-1  Creating
Container cluster-master-2-1  Created
Container cluster-master-2-1  Starting
Container cluster-master-2-1  Started
[2022-02-28T16:00:07+01:00] master-2 version
v2.13.0-207-gdb321b9fc
[2022-02-28T16:00:08+01:00] Stopping master-1
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopped
[2022-02-28T16:01:13+01:00] Bring services *-2 into final states
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
[2022-02-28T16:01:18+01:00] Cancel downtimes for services *-2
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-2' and 0 child downtimes."}]}
[2022-02-28T16:02:19+01:00] Notification logs from master-2
cluster-master-2-1  | [2022-02-28 15:59:13 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 15:59:13 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 15:59:13 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 15:59:13 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 15:59:13 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 15:59:13 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 16:01:18 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 16:01:18 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 16:01:19 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 16:01:19 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 16:01:19 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 16:01:19 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 16:01:20 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 16:01:20 +0100] information/Notification: Sending 'Problem' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 16:01:20 +0100] information/Notification: Completed sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-02-28 16:01:20 +0100] information/Notification: Completed sending 'Problem' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
[2022-02-28T16:02:19+01:00] Notification logs from master-2 (filtered)
cluster-master-2-1  | [2022-02-28 16:01:20 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-02-28 16:01:20 +0100] information/Notification: Sending 'Problem' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
Container cluster-master-2-1  Stopping
Container cluster-master-2-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopped
Container cluster-master-1-1  Removing
Container cluster-master-1-1  Removed
Container cluster-master-2-1  Stopped
Container cluster-master-2-1  Removing
Container cluster-master-2-1  Removed
Network cluster_default  Removing
Network cluster_default  Removed

After (90848f6)

Both nodes

  • send a recover notification for the CRITICAL -> WARNING -> OK services (rec-1/2)
  • send a problem notification for the CRITICAL -> WARNING -> WARNING services (problm-1/2)
  • don't send any notification for the CRITICAL -> WARNING -> CRITICAL services (none-1/2)

master-1:

[2022-03-03 17:22:33 +0100] information/Notification: Sending 'Recovery' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
[2022-03-03 17:22:33 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'

master-2:

[2022-03-03 17:25:47 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
[2022-03-03 17:25:47 +0100] information/Notification: Sending 'Recovery' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
Full output
[2022-03-03T17:18:47+01:00] Ensure clean initial state
Container cluster-master-2-1  Creating
Container cluster-master-1-1  Creating
Container cluster-master-2-1  Created
Container cluster-master-1-1  Created
Container cluster-master-1-1  Starting
Container cluster-master-2-1  Starting
Container cluster-master-2-1  Started
Container cluster-master-1-1  Started
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
Container cluster-master-2-1  Stopping
Container cluster-master-2-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-2-1  Stopped
Container cluster-master-2-1  Removing
Container cluster-master-2-1  Removed
Container cluster-master-1-1  Stopped
Container cluster-master-1-1  Removing
Container cluster-master-1-1  Removed
Network cluster_default  Removing
Network cluster_default  Removed
[2022-03-03T17:21:04+01:00] Starting master-1
Network cluster_default  Creating
Network cluster_default  Created
Container cluster-master-1-1  Creating
Container cluster-master-1-1  Created
Container cluster-master-1-1  Starting
Container cluster-master-1-1  Started
[2022-03-03T17:22:05+01:00] master-1 version
v2.13.0-232-g90848f602
[2022-03-03T17:22:05+01:00] Ensure all services are CRITICAL
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
[2022-03-03T17:22:11+01:00] Schedule downtimes for all services
{"results":[{"code":200,"legacy_id":12,"name":"github-9207!none-1!94c357cb-c634-4cac-9778-e560b84ae502","status":"Successfully scheduled downtime 'github-9207!none-1!94c357cb-c634-4cac-9778-e560b84ae502' for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"legacy_id":13,"name":"github-9207!none-2!067bf1fc-8b4b-42dd-a2ae-b3fcbac01282","status":"Successfully scheduled downtime 'github-9207!none-2!067bf1fc-8b4b-42dd-a2ae-b3fcbac01282' for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"legacy_id":14,"name":"github-9207!rec-1!7d16becf-326b-40af-bdc9-40bcb6ec4957","status":"Successfully scheduled downtime 'github-9207!rec-1!7d16becf-326b-40af-bdc9-40bcb6ec4957' for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"legacy_id":15,"name":"github-9207!rec-2!b50d5f5c-b8a6-4f0f-a904-63f0a41d20d3","status":"Successfully scheduled downtime 'github-9207!rec-2!b50d5f5c-b8a6-4f0f-a904-63f0a41d20d3' for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"legacy_id":16,"name":"github-9207!problm-1!a67eb273-b124-44c5-a08c-4973eb1e882d","status":"Successfully scheduled downtime 'github-9207!problm-1!a67eb273-b124-44c5-a08c-4973eb1e882d' for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"legacy_id":17,"name":"github-9207!problm-2!c91aa1f2-028f-4766-9c77-dda19f71e7cd","status":"Successfully scheduled downtime 'github-9207!problm-2!c91aa1f2-028f-4766-9c77-dda19f71e7cd' for object 'github-9207!problm-2'."}]}
[2022-03-03T17:22:17+01:00] Make all services WARNING
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
[2022-03-03T17:22:23+01:00] Bring services *-1 into final states
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-1'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-1'."}]}
[2022-03-03T17:22:29+01:00] Cancel downtimes for services *-1
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-1' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-1' and 0 child downtimes."}]}
[2022-03-03T17:23:29+01:00] Notification logs from master-1
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!none-1!dummy-service-notification' for checkable 'github-9207!none-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!rec-1!dummy-service-notification' for checkable 'github-9207!rec-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!problm-1!dummy-service-notification' for checkable 'github-9207!problm-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:12 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:29 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!none-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:29 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!none-1!dummy-service-notification' for checkable 'github-9207!none-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:29 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:29 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!rec-1!dummy-service-notification' for checkable 'github-9207!rec-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:29 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:29 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!problm-1!dummy-service-notification' for checkable 'github-9207!problm-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:33 +0100] information/Notification: Sending 'Recovery' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:33 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:33 +0100] information/Notification: Completed sending 'Recovery' notification 'github-9207!rec-1!dummy-service-notification' for checkable 'github-9207!rec-1' and user 'dummy' using command 'dummy'.
cluster-master-1-1  | [2022-03-03 17:22:33 +0100] information/Notification: Completed sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for checkable 'github-9207!problm-1' and user 'dummy' using command 'dummy'.
[2022-03-03T17:23:30+01:00] Notification logs from master-1 (filtered)
cluster-master-1-1  | [2022-03-03 17:22:33 +0100] information/Notification: Sending 'Recovery' notification 'github-9207!rec-1!dummy-service-notification' for user 'dummy'
cluster-master-1-1  | [2022-03-03 17:22:33 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-1!dummy-service-notification' for user 'dummy'
[2022-03-03T17:23:30+01:00] Starting master-2
Container cluster-master-2-1  Creating
Container cluster-master-2-1  Created
Container cluster-master-2-1  Starting
Container cluster-master-2-1  Started
[2022-03-03T17:24:30+01:00] master-2 version
v2.13.0-232-g90848f602
[2022-03-03T17:24:30+01:00] Stopping master-1
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopped
[2022-03-03T17:25:37+01:00] Bring services *-2 into final states
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!none-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!rec-2'."}]}
{"results":[{"code":200,"status":"Successfully processed check result for object 'github-9207!problm-2'."}]}
[2022-03-03T17:25:42+01:00] Cancel downtimes for services *-2
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!none-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!rec-2' and 0 child downtimes."}]}
{"results":[{"code":200,"status":"Successfully removed all downtimes for object 'github-9207!problm-2' and 0 child downtimes."}]}
[2022-03-03T17:26:43+01:00] Notification logs from master-2
cluster-master-2-1  | [2022-03-03 17:23:35 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:23:35 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:23:35 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:23:35 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:23:35 +0100] information/Notification: Sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:23:35 +0100] information/Notification: Completed sending 'DowntimeStart' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:25:42 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!none-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:25:42 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!none-2!dummy-service-notification' for checkable 'github-9207!none-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:25:42 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:25:42 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:25:43 +0100] information/Notification: Sending 'DowntimeEnd' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:25:43 +0100] information/Notification: Completed sending 'DowntimeEnd' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:25:47 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:25:47 +0100] information/Notification: Sending 'Recovery' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:25:47 +0100] information/Notification: Completed sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for checkable 'github-9207!problm-2' and user 'dummy' using command 'dummy'.
cluster-master-2-1  | [2022-03-03 17:25:47 +0100] information/Notification: Completed sending 'Recovery' notification 'github-9207!rec-2!dummy-service-notification' for checkable 'github-9207!rec-2' and user 'dummy' using command 'dummy'.
[2022-03-03T17:26:43+01:00] Notification logs from master-2 (filtered)
cluster-master-2-1  | [2022-03-03 17:25:47 +0100] information/Notification: Sending 'Problem' notification 'github-9207!problm-2!dummy-service-notification' for user 'dummy'
cluster-master-2-1  | [2022-03-03 17:25:47 +0100] information/Notification: Sending 'Recovery' notification 'github-9207!rec-2!dummy-service-notification' for user 'dummy'
Container cluster-master-2-1  Stopping
Container cluster-master-2-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopping
Container cluster-master-1-1  Stopped
Container cluster-master-1-1  Removing
Container cluster-master-1-1  Removed
Container cluster-master-2-1  Stopped
Container cluster-master-2-1  Removing
Container cluster-master-2-1  Removed
Network cluster_default  Removing
Network cluster_default  Removed

Note: during my tests, sometimes notifications were only requested but not actually sent. With notice log enabled, this then looked like this:

[2022-03-03 16:45:38 +0100] notice/Notification: Attempting to send notifications of type 'Recovery' for notification object 'github-9207
!rec-1!dummy-service-notification'.
[2022-03-03 16:45:38 +0100] notice/Notification: Attempting to send notifications of type 'Recovery' for notification object 'github-9207
!rec-1!dummy-service-notification'.
[2022-03-03 16:45:38 +0100] notice/Notification: Notification object 'github-9207!rec-1!dummy-service-notification': We did not notify user 'dummy' (Problem types enabled) for a problem before. Not sending Recovery notification.

So the new code was correctly requesting the notification but it got skipped later on. I believe this happens as the cluster was restarted between the initial problem notification was sent and the notification is triggered and some state is lost there.

fixes #9054
closes #9090
closes #9146

@cla-bot cla-bot bot added the cla/signed label Feb 4, 2022
@icinga-probot icinga-probot bot added bug Something isn't working ref/IP ref/NC labels Feb 4, 2022
@julianbrost julianbrost requested a review from Al2Klimov February 4, 2022 12:54
@julianbrost
Copy link
Contributor Author

Please do a first round of review so that we agree on the general change before I continue with addressing the remaining TODOs (some of which involve a discussion anyways).

@julianbrost julianbrost added this to the 2.14.0 milestone Feb 4, 2022
@@ -477,7 +477,7 @@ void Checkable::ProcessCheckResult(const CheckResult::Ptr& cr, const MessageOrig

if (send_notification && !is_flapping) {
if (!IsPaused()) {
if (suppress_notification) {
if (suppress_notification || (GetSuppressedNotifications() & (NotificationRecovery|NotificationProblem))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... this looks like if suppression is over, but the suppressed haven't been fired yet, let's neutralize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not neutralize but rather keep the suppression ongoing until suppressed notifications are fired. This is needed because otherwise, a notification might be sent because there was a state change compared to the last state within the suppression period.

But I should probably add a comment here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this...

@@ -494,12 +494,22 @@ void Checkable::ProcessCheckResult(const CheckResult::Ptr& cr, const MessageOrig
int suppressed_types_before (GetSuppressedNotifications());
int suppressed_types_after (suppressed_types_before | suppressed_types);

for (int conflict : {NotificationProblem | NotificationRecovery, NotificationFlappingStart | NotificationFlappingEnd}) {
/* E.g. problem and recovery notifications neutralize each other. */
int conflict = NotificationFlappingStart | NotificationFlappingEnd;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be const (for completeness).

if ((suppressed_types_after & conflict) == conflict) {
suppressed_types_after &= ~conflict;
}
int stateNotifications = NotificationRecovery | NotificationProblem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be const (for completeness).

suppressed_types_after &= ~conflict;
}
int stateNotifications = NotificationRecovery | NotificationProblem;
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, stared long enough on this line to give up all this for now. We can continue this together. On a whiteboard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extended the PR description (the numbered list on top is new), maybe this helps to understand. This condition is the check if either of the two flags is set for the very first time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And... if none are set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean? Set as in added to suppressed_notifications in general by this function? Or in some of the variables specifically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean: to me this looks like intended as XOR, but implemented as NAND – one more trigger case. Or does the latter never happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's easier to read the conditions the other way around:

  • suppressed_types & stateNotifications: This means that the check result would have triggered a state notification (i.e. either a problem or a recovery notification) if it wasn't suppressed and therefore is stored for later.
  • !(suppressed_types_before & stateNotifications): This part checks that neither NotificationProblem nor NotificationRecovery were set before. So this is the very first time a state notification is generated during the current suppression period.

With the new logic, it isn't really necessary to differentiate between the two types of state notifications. All the code would also work if only a single "NotificationState" flag was added to suppressed_notifications. It's mainly done this way as the bitset and NotificationType enum already exist in their current form.

For this PR, you could basically ignore states where only one of NotificationProblem/NotificationRecovery is set and treat them as if both were set. The code in FireSuppressedNotifications will figure out which one is appropriate in the end, if either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I get it! (!BEFORE) && AFTER

@julianbrost julianbrost force-pushed the bugfix/suppressed-state-notifications branch from 5709d9c to 3eb127d Compare February 9, 2022 15:44
Al2Klimov
Al2Klimov previously approved these changes Feb 21, 2022
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now as we don't over-neutralise the bitmask, #9146 is definitely covered.

And due to the new attribute collateral damage is definitely avoided.

@@ -175,6 +175,9 @@ abstract class Checkable : CustomVarObject
[state, no_user_view, no_user_modify] int suppressed_notifications {
default {{{ return 0; }}}
};
[state, enum, no_user_view, no_user_modify] ServiceState state_before_suppression {
default {{{ return ServiceOK; }}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice feels definitely right (implies #7815 a bit).

Comment on lines +242 to +263
return cr && !IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
}
case NotificationRecovery:
{
auto cr (GetLastCheckResult());
return cr && IsStateOK(cr->GetState());
return cr && IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This respects that there's not only one problem state, good.

@@ -224,12 +239,12 @@ bool Checkable::NotificationReasonApplies(NotificationType type)
case NotificationProblem:
{
auto cr (GetLastCheckResult());
return cr && !IsStateOK(cr->GetState()) && GetStateType() == StateTypeHard;
return cr && !IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps soft states...


continue;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and that suppresses them. Fine.

suppressed_types_after &= ~conflict;
}
int stateNotifications = NotificationRecovery | NotificationProblem;
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I get it! (!BEFORE) && AFTER

* WARNING -> OK -> CRITICAL generates both types once, but there should still be a notification.
*/
// TODO: Is there a better way to get this value than this case distinction?
SetStateBeforeSuppression(old_stateType == StateTypeHard ? old_state : ServiceOK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the ?: for, but it definitely doesn’t harm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the last state before the current state that was a hard state. The TODO is there because I didn't find an existing variable or function that gives me this exact value, but I might have missed something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon closer inspection, I think cr->GetPreviousHardState() should work there but doesn't because of #9254 (comment)

suppressed_types_after &= ~conflict;
}
const int stateNotifications = NotificationRecovery | NotificationProblem;
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now problem|rec. don’t neutralise each other here but – I guess – a timer and the changed NotificationReasonApplies do this job.

@julianbrost julianbrost changed the title Checkable: send state notifications after suppression if and only it the state differs compared to before the suppression started Checkable: send state notifications after suppression if and only if the state differs compared to before the suppression started Feb 21, 2022
@julianbrost julianbrost force-pushed the bugfix/suppressed-state-notifications branch from 3eb127d to e9e7234 Compare February 23, 2022 11:19
@julianbrost
Copy link
Contributor Author

@Al2Klimov Before I do the final cleanup and testing, can you please provide some feedback on one specific question? Do you think it's fine to have Checkable::OnSuppressedNotificationsChanged implicitly sync the new state_before_suppression attribute as well as done in 4e97177 or should I rather add an extra function and signal similar to how I did in #9112?

@Al2Klimov
Copy link
Member

If you need to sync it at all (because it's not computed on the other node the same way due to $CHECKRESULT) I'd make one signal+handler per property.

@julianbrost
Copy link
Contributor Author

If you need to sync it at all (because it's not computed on the other node the same way due to $CHECKRESULT)

Yes, setting the attribute depends on the other notification logic in ProcessCheckResult that's within if (!IsPaused()).

I'd make one signal+handler per property.

So that there are two JSON-RPC messages sent when state_before_suppression is set initially?

@julianbrost julianbrost force-pushed the bugfix/suppressed-state-notifications branch from e9e7234 to a2bd87d Compare February 23, 2022 13:27
@Al2Klimov
Copy link
Member

Yes, but that's just my personal preference.

@julianbrost julianbrost force-pushed the bugfix/suppressed-state-notifications branch 5 times, most recently from e4514ff to 3280440 Compare February 25, 2022 09:22
@julianbrost julianbrost marked this pull request as ready for review February 28, 2022 16:25
lib/icinga/clusterevents.cpp Show resolved Hide resolved
test/icinga-checkresult.cpp Outdated Show resolved Hide resolved
test/icinga-checkresult.cpp Outdated Show resolved Hide resolved
test/icinga-checkresult.cpp Outdated Show resolved Hide resolved
test/icinga-checkresult.cpp Outdated Show resolved Hide resolved
test/icinga-checkresult.cpp Outdated Show resolved Hide resolved
test/icinga-checkresult.cpp Outdated Show resolved Hide resolved
@julianbrost julianbrost force-pushed the bugfix/suppressed-state-notifications branch 2 times, most recently from 48dc544 to a74b101 Compare March 1, 2022 15:42
@julianbrost julianbrost requested a review from Al2Klimov March 1, 2022 15:43
Al2Klimov
Al2Klimov previously approved these changes Mar 1, 2022
@julianbrost
Copy link
Contributor Author

julianbrost commented Mar 2, 2022

https://github.com/Icinga/icinga2/runs/5377835600?check_suite_focus=true#step:7:47590

./test/icinga-checkresult.cpp(957): error: in "icinga_checkresult/suppressed_notification": check service->GetStateBeforeSuppression() == initialState has failed [0 != 3]

But only on Raspbian bullseye. I'll see if I'm able to reproduce this. But I don't think that has happened on previous versions of the PR on GH Actions and also ProcessCheckResult really should behave the same on all architectues. Not sure what to do with this when I can't reproduce it though :\

@julianbrost
Copy link
Contributor Author

There are data races between Checkable::ProcessCheckResult() and Checkable::FireSuppressedNotifications(). Which only shows in the tests if this test case is running for more than 5 seconds, that's why it only shows on the slow emulated Raspbian builds.

Reducing the timer interval allows reproducing the issue on other platforms:

87: Running 1 test case...
87: Test case: volatile=0 checkAttempts=1 sequence={OK OK OK OK OK}
87: [2022-03-02 10:41:57 +0000] information/Timer: 0x555c5cf8a980 Start()
87:  0# icinga::Timer::Start() in /home/jbrost/dev/icinga2/cmake-build-debug/Bin/Debug/boosttest-test-base
[...]
87:  4# icinga::Checkable::Start(bool) in /home/jbrost/dev/icinga2/cmake-build-debug/Bin/Debug/boosttest-test-base
[...]
87: [2022-03-02 10:41:57 +0000] information/Timer: 0x555c5cf62330 Start()
87:  0# icinga::Timer::Start() in /home/jbrost/dev/icinga2/cmake-build-debug/Bin/Debug/boosttest-test-base
[...]
87:  4# icinga::Checkable::Start(bool) in /home/jbrost/dev/icinga2/cmake-build-debug/Bin/Debug/boosttest-test-base
[...]
87: Test case: volatile=1 checkAttempts=2 sequence={WARNING OK WARNING UNKNOWN UNKNOWN}
87:   ProcessCheckResult(WARNING)
87:   ProcessCheckResult(WARNING)
87:   Downtime Start
87: [2022-03-02 10:41:59 +0000] information/Downtime: Triggering downtime 'downtime' for checkable 'service'.
87:   ProcessCheckResult(OK)
87:   ProcessCheckResult(WARNING)
87:   ProcessCheckResult(UNKNOWN)
87: [2022-03-02 10:41:59 +0000] information/Timer: 0x555c5cf8a980 Call()
87:   ProcessCheckResult(UNKNOWN)
87: /home/jbrost/dev/icinga2/test/icinga-checkresult.cpp(957): error: in "icinga_checkresult/suppressed_notification": check service->GetStateBeforeSuppression() == initialState has failed [0 != 1]
87:   Downtime End
87:   FireSuppressedNotifications()
87:   -> OnNotificationsRequested(Problem, UNKNOWN)

(The timer started first in Checkable::Start is Checkable::FireSuppressedNotificationsTimer.)

@julianbrost julianbrost marked this pull request as draft March 2, 2022 12:12
@julianbrost
Copy link
Contributor Author

Apart from the race condition, there's also a small logic error: The following else may still clear problem/recovery notifications from the suppressed notifications while the suppression reason is still ongoing (which should no longer happen with this PR):

if (still_applies) {
if (!NotificationReasonSuppressed(type) && !IsLikelyToBeCheckedSoon() && !wasLastParentRecoveryRecent.Get()) {
Checkable::OnNotificationsRequested(this, type, GetLastCheckResult(), "", "", nullptr);
subtract |= type;
}
} else {
subtract |= type;
}

This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.

To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.
This ensures that in case of a failover in an HA zone, the other can take over
properly and has the required state to send the proper notifications.
@julianbrost julianbrost force-pushed the bugfix/suppressed-state-notifications branch from a74b101 to 90848f6 Compare March 3, 2022 13:26
@julianbrost julianbrost marked this pull request as ready for review March 3, 2022 16:36
@julianbrost julianbrost requested a review from Al2Klimov March 3, 2022 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Fix was included in a bugfix release bug Something isn't working cla/signed ref/IP ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing recovery notification after a downtime with new hard state change
2 participants