Failed alerts retry forever #21431

Joel-Duffield-Graylog · 2025-01-23T19:23:07Z

Expected Behavior

Current Behavior

When an alert fails to run (timeout etc) it will try again in 5 seconds, but if it keeps failing it will stay in that loop forever apparently. If this is a large query, it can then cause other alerts to start to timeout, and the problem just snowballs.
This is logged in server.log, however, there is nothing to tell the user this is happening, so if it was a critical alert you would have no idea that it was no longer working properly, and this failure could go on for days unnoticed.

Possible Solution

There should be a way to control this behavior, ideally at the individual alert level because some alerts you may want to retry a little more than others depending on criticality etc. or have a global configurable max number of retries etc.

Steps to Reproduce (for bugs)

Context

Your Environment

Graylog Version: 6.1.5
Java Version:
OpenSearch Version:
MongoDB Version:
Operating System:
Browser version:

tellistone · 2025-01-23T21:42:51Z

As discussed, first suspicion here is that the problem is upstream (why are jobs running slow or timing out in the first place)

I will reach out to this user next week to investigate their config before we proceed with this as a potential bug. to check eg.

is their index set shard count rational
is their index set strategy rational
is their shard size in the realm of 0.6x to 0.7x OS node ram
is their shard count within the realms of what their OS cluster can cope with

and look at their alerts

are their alerts configured to run against specific streams, or are they hitting all indexes at once?
do any alerts contain ruinous regex queries?
how often do their bespoke alerts run? - the UI lets users configure them to run every second, should they wish to let chaos take the world.

Joel-Duffield-Graylog added the bug label Jan 23, 2025

tellistone added the triaged label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed alerts retry forever #21431

Failed alerts retry forever #21431

Joel-Duffield-Graylog commented Jan 23, 2025

tellistone commented Jan 23, 2025 •

edited

Loading

Failed alerts retry forever #21431

Failed alerts retry forever #21431

Comments

Joel-Duffield-Graylog commented Jan 23, 2025

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

tellistone commented Jan 23, 2025 • edited Loading

tellistone commented Jan 23, 2025 •

edited

Loading