Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed alerts retry forever #21431

Open
Joel-Duffield-Graylog opened this issue Jan 23, 2025 · 1 comment
Open

Failed alerts retry forever #21431

Joel-Duffield-Graylog opened this issue Jan 23, 2025 · 1 comment

Comments

@Joel-Duffield-Graylog
Copy link

Expected Behavior

Current Behavior

When an alert fails to run (timeout etc) it will try again in 5 seconds, but if it keeps failing it will stay in that loop forever apparently. If this is a large query, it can then cause other alerts to start to timeout, and the problem just snowballs.
This is logged in server.log, however, there is nothing to tell the user this is happening, so if it was a critical alert you would have no idea that it was no longer working properly, and this failure could go on for days unnoticed.

Possible Solution

There should be a way to control this behavior, ideally at the individual alert level because some alerts you may want to retry a little more than others depending on criticality etc. or have a global configurable max number of retries etc.

Steps to Reproduce (for bugs)

Context

Your Environment

  • Graylog Version: 6.1.5
  • Java Version:
  • OpenSearch Version:
  • MongoDB Version:
  • Operating System:
  • Browser version:
@tellistone
Copy link

tellistone commented Jan 23, 2025

As discussed, first suspicion here is that the problem is upstream (why are jobs running slow or timing out in the first place)

I will reach out to this user next week to investigate their config before we proceed with this as a potential bug. to check eg.

  • is their index set shard count rational
  • is their index set strategy rational
  • is their shard size in the realm of 0.6x to 0.7x OS node ram
  • is their shard count within the realms of what their OS cluster can cope with

and look at their alerts

  • are their alerts configured to run against specific streams, or are they hitting all indexes at once?
  • do any alerts contain ruinous regex queries?
  • how often do their bespoke alerts run? - the UI lets users configure them to run every second, should they wish to let chaos take the world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants