Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let redundancy groups not just fail #10190

Open
nilmerg opened this issue Oct 16, 2024 · 5 comments · May be fixed by #10290
Open

Let redundancy groups not just fail #10190

nilmerg opened this issue Oct 16, 2024 · 5 comments · May be fixed by #10290
Assignees
Labels
area/runtime Downtimes, comments, dependencies, events

Comments

@nilmerg
Copy link
Member

nilmerg commented Oct 16, 2024

This is not a bug, not yet, at least. Once #10014 is fixed, this should be considered.

tl/dr

Redundancy groups should not just fail, they should in addition be not reachable or not determinable, just like hosts and services are.

What's this about?

graph LR;
    ChildHost-->pa["ParentHostA (Group 1)"];
    ChildHost-->pb["ParentHostB (Group 1)"];
    pa-->ga["GrandParentA"];
    pb-->gb["GrandParentB"];
Loading

Right now, the child host will be unreachable in two cases:

  1. Both parents are down (expected)
  2. One of them is unreachable (The linked bug)

Once the second case is fixed, it will likely behave like this instead:

  1. Both parents are down (still, expected)
  2. Both parents are unreachable (new, expected)

But what happens in case only one of the parents is down and the other unreachable?

  1. One parent is down and the other unreachable (new, ??)

I suppose so, since already right now, the availability of the child host is influenced by its parent's reachable state. It will just be extended to consider redundancy groups in such a case.

But then there is Dependency::IsAvailable

Which is what is used to determine a redundancy group's state at the moment. Though, it doesn't consider the parent's reachable state at all. The parent might be UP, but unreachable, and it returns true.

So, in order to really ensure that the child host is unreachable in the third case, this might need to change. (The potential bug)

Of course, this has larger side-effects, as it changes how dependencies work in general.

Which is what this issue is really about! As I believe this is a good thing. Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway. This discrepancy will only pop up in case checks are disabled on the parent in question. (As if checks run, the parent will eventually go down/unknown)

Additional note

What led me to this, was the wish to understand how a redundancy group's state will be determined as part of Icinga/icingadb#347.

As of today I thought it simply represents whether all related dependencies failed or not. But now a failed dependency might be failing, because a parent might just not be reachable. And so will a redundancy group.

This all sounds fine, unless you want to identify the actual root problem in a failing dependency chain. (A root problem is a dependency node, which is reachable and has a problem)

In a dependency chain, the current plan is, to represent redundancy groups as actual nodes with an actual state. While hosts and services also have state, they can in addition be not reachable. Redundancy groups can only fail.

But a redundancy group, with one host member that is UP but not reachable, is technically also not reachable. So it's not just a failing dependency, but one that is … not determinable?

@yhabteab
Copy link
Member

yhabteab commented Nov 13, 2024

Which is what this issue is really about!

To be honest, I don't quite understand what this issue is all about :). You have listed a number of open questions that don't quite fit the title of the issue.

Redundancy groups should not just fail, they should in addition be not reachable or not determinable, just like hosts and services are.

This particular case might be covered by #10227 as that issue is about syncing dependencies to Redis.

But what happens in case only one of the parents is down and the other unreachable?

Which is what is used to determine a redundancy group's state at the moment. Though, it doesn't consider the parent's reachable state at all. The parent might be UP, but unreachable, and it returns true.

So, in order to really ensure that the child host is unreachable in the third case, this might need to change. (The potential bug)

The Dependency::IsAvailable() method is concerned with verifying the availability of a given dependency, rather than determining the reachability of its direct or intermediate parents. With #10228, the child host will of course at least for me be unreachable. It is not entirely clear to me, however, why this particular case should be treated differently from the others.

I suppose so, since already right now, the availability of the child host is influenced by its parent's reachable state. It will just be extended to consider redundancy groups in such a case.

Isn't that what #10014 is about? That's a confirmed bug that will be fixed by #10228, but I don't see how that applies to this issue.

Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway.

This is already the case, all parents in a given dependency graph are evaluated recursively, but no redundancy groups have been taken into account so far.

But a redundancy group, with one host member that is UP but not reachable, is technically also not reachable. So it's not just a failing dependency, but one that is … not determinable?

Well, it's simply unreachable. Isn't the unreachable state intended for use cases like this? There's no need to add another meaning to it.

@nilmerg
Copy link
Member Author

nilmerg commented Nov 13, 2024

The main point of this issue is that redundancy groups don't have a reachability flag in the current design proposal. They need one however, as Web needs it to identify root problems. #10227 does not include this yet, as Icinga doesn't even consider it.

#10228 does exactly miss the point I made here and introduces contradictory states: A host or service might now be unreachable, although every related dependency states it is available.

I don't know how #10227 determines a group's state at the moment, but let's assume it's a loop over all related dependencies which are asked if they're available. Which they all are.

And in this case, this contradiction is written to redis and subsequently to the database where Web will have a hard time following a hierarchy up to the actual root problem as when it reaches the group in question it stops, as it isn't failed. And of course, it didn't fail, it just doesn't have any member that is reachable.

And now let's assume every member of a redundancy group is itself dependent on another (grand) parent, just like in my example. The group only fails (and thus states the child isn't reachable) if all members are either unreachable themselves or in such a state that the dependency fails. But what if one of the members isn't in such a state that the dependency fails but it's (grand) parent is in such a state that their dependency fails? The answer: Nothing, unless the parent is checked. Which it won't if the parent <-> grandparent dependency disables checks.

--

I know this is a culmination of several issues. But I hope I made it clear now what I'm up to.

@yhabteab
Copy link
Member

The main point of this issue is that redundancy groups don't have a reachability flag in the current design proposal. They need one however, as Web needs it to identify root problems. #10227 does not include this yet, as Icinga doesn't even consider it.

Well, redundancy groups simply don't exist from Icinga 2's point of view, or at least they're not one of those things that can have states and I don't think it would make sense to change that. If in doubt, #10227, if I understood @oxzi correctly, will construct the redundancy groups virtually and sync them to Redis. So I definitely think this is something that should be fixed with or based on #10227's codebase.

#10228 does exactly miss the point I made here and introduces contradictory states: A host or service might now be unreachable, although every related dependency states it is available.

With #10228, a host or service becomes unreachable when any of its parent dependencies become unreachable, so I don't know why this contradicts the point you made. After all...

Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway.

... isn't the PR doing exactly what you've asked for here?

I don't know how #10227 determines a group's state at the moment, but let's assume it's a loop over all related dependencies which are asked if they're available. Which they all are.

This can only be answered by @oxzi. Since Icinga/icingadb#347 (comment) have listed all the necessary database tables, I guess this needs to be done or maybe is already in progress with #10227.

And in this case, this contradiction is written to redis and subsequently to the database where Web will have a hard time following a hierarchy up to the actual root problem as when it reaches the group in question it stops, as it isn't failed. And of course, it didn't fail, it just doesn't have any member that is reachable.

That sounds to me like a problem that might occur in the future, not something that already exists. There isn't even a corresponding PR for #10227 yet, so there doesn't seem to be any point in discussing it here now. Originally I just wanted to understand this issue and see if I could fix it in parallel with #10227, but these two are mutually dependent.

But what if one of the members isn't in such a state that the dependency fails but it's (grand) parent is in such a state that their dependency fails? The answer: Nothing, unless the parent is checked. Which it won't if the parent <-> grandparent dependency disables checks.

If I understand your twisted question correctly :), this seems to be related to #10143.

@nilmerg
Copy link
Member Author

nilmerg commented Nov 19, 2024

So I definitely think this is something that should be fixed with or based on #10227 codebase.

✅ 😅

With #10228, a host or service becomes unreachable when any of its parent dependencies become unreachable, so I don't know why this contradicts the point you made. After all...

reachable is a state which a child is in, available is what a parent can be or not. There are two distinct objects, which one's reachability depends on the others availability. Even with #10228, a child becomes not reachable while the parent is still available.

... isn't the PR doing exactly what you've asked for here?

No, it doesn't touch anything related to redundancy groups. And it cannot either, since they don't exist yet, as you pointed out. What I ask here cannot be fixed by a single change anyway, I just wanted to outline my theory that there is something wrong with how reachability and availability are currently determined.

That sounds to me like a problem that might occur in the future, not something that already exists.

It certainly will, if one relies on what Dependency::IsAvailable states :P

If I understand your twisted question correctly :), this seems to be related to #10143.

Yes…

I know this is a culmination of several issues. But I hope I made it clear now what I'm up to.

@nilmerg
Copy link
Member Author

nilmerg commented Nov 28, 2024

Recently, we discussed this in person and I wasn't so sure anymore whether we required failed and is_reachable, or just one of them. Now I'am sure, we need both. Let me explain:

Note that, as always, implicit relations (host -> service) are of no importance here.

graph TD;
    A-->B;
    B-->D;
    C-->D;
    subgraph BC [ redundancy group ]
    B~~~C
   end
Loading

The group is failed if B and C are reachable and cause D to be not reachable.
The group is not reachable if either B or C are as well not reachable. In this example only B cannot.

#10158 talks of affects_children for host and service states. This boolean is very similar to my definition of a group's failed state. Both are only true, if the state can be reliably determined and is responsible for any children to be not reachable. (i.e. the object in question is reachable)

If affects_children or failed is true, is_reachable can by definition then also only be true. On the other hand, if is_reachable is false, affects_children and failed cannot be determined and hence cannot be relied upon.

Now, the root problem of say D, is the closest parent that is reachable and responsible. (affects_children=true, failed=true)

graph TD;
    A[down]-->B;
    B["up (unreachable)"]-->D;
    C[down]-->D["down (unreachable)"];
    subgraph BC [ unreachable ]
    B~~~C
   end
Loading

In this case D is not reachable and A is clearly the root problem.

@yhabteab yhabteab linked a pull request Jan 13, 2025 that will close this issue
@yhabteab yhabteab added the area/runtime Downtimes, comments, dependencies, events label Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Downtimes, comments, dependencies, events
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants