Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activator health checks #15575

Open
thorweijie opened this issue Oct 16, 2024 · 2 comments
Open

Activator health checks #15575

thorweijie opened this issue Oct 16, 2024 · 2 comments
Labels
kind/question Further information is requested lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@thorweijie
Copy link

Ask your question here:

We have a kubernetes cluster with many inference services. After all the inference services were restarted, we noticed the istio-proxy container in activator pods were having high cpu usage and health checks were failing with response code 0, so we set target burst capacity to 0 to bypass the activator and fix the issue. We noticed that despite being skipped, the activator pods were still trying to perform health checks with response code 0 until they were restarted. We would like to know if the health checks for activator are cached, and whether the frequency of the health checks can be configured?

@thorweijie thorweijie added the kind/question Further information is requested label Oct 16, 2024
@skonto
Copy link
Contributor

skonto commented Oct 23, 2024

Hi @thorweijie!

After all the inference services were restarted, we noticed the istio-proxy container in activator pods were having high cpu usage and health checks were failing with response code 0

What healthchecks were failing, the activator ones?

We noticed that despite being skipped, the activator pods were still trying to perform health checks with response code 0 until they were restarted. We would like to know if the health checks for activator are cached, and whether the frequency of the health checks can be configured?

The probing mechanism is started when endpoints are created/updated with a default frequency of 200ms.
If probing finished successfully you should see this msg assuming you enable activator debug logging:

{"severity":"DEBUG","timestamp":"2024-10-23T14:20:52.082125337Z","logger":"activator","caller":"net/revision_backends.go:348","message":"Done probing, got 1 healthy pods","commit":"0abee66","knative.dev/controller":"activator","knative.dev/pod":"activator-8675c9944c-mdfj9","knative.dev/key":"default/autoscale-go-00001"}

Once all pods are ready (and stay that way) probing should stop. The idea is that activator is in standby mode to handle traffic and so each activator instance needs to know ready targets so it can route traffic to them if needed.
Afaik there is no caching. Maybe @ReToCode, @dprotaso have more to say here.

Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants