Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admission control threat model. #2

Merged
merged 1 commit into from
May 24, 2022
Merged

Admission control threat model. #2

merged 1 commit into from
May 24, 2022

Conversation

jvanz
Copy link
Member

@jvanz jvanz commented Apr 18, 2022

Adds a RFC to discuss the admission control threat model and its mitigations.

Fix kubewarden/kubewarden-controller#129

Copy link
Contributor

@raulcabello raulcabello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jvanz ! I just left some minor comments

timeouts as the admission controller uses compute power to process the workloads

**Mitigation**
Webhook fail closed and authenticate callers. We are safe, Kubewarden already does that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to mention that someone could change failurePolicy to ignore. But by default is fail so I agree that we are safe here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in Threat 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in this case, I think the only reasonable way to mitigate this issue is to introduce allow each policy to run for at most X seconds. Once the timeout is reached the evaluation response will be a rejection accompanied by a certain error code and message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users with permissions can change that. This is also mapped in threat #4. But I still believe we should keep failing closed. This is a good practice to avoid bad actors from deploying malicious apps during a DOS. Even if we add some timeout, given enough load, that may not be sufficient. Furthermore, as far as I can remember, users can change this behavior with helm values.


**Mitigation**
All rules are reviewed and tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod-privileged-policy can also be used here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @raulcabello mentions, if an explicit pod-privileged-policy is installed (validating only), then this cannot happen. Regardless of whatever the mutating policy did, the validating policy has to approve the request.

This is a harder situation for example if a user deploys a policy that does mutation and validation in the same policy (e.g. to save time reducing requests made by the api server). If a policy does many things, then it's up to the policy author to make sure that this kind of situation does not happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it sounds like a gotcha that deserves a paragraph in the policy authors docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our "PSP best practices" selection deploys the pod-privileged-policy, this is not going to be an issue. Even if the user provides a mutating policy that by design/mistake makes a container privileged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "PSP best practices" includes the pod-privileged-policy.

Copy link
Member

@ereslibre ereslibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jvanz!


Webhook fails closed. In other words, if the webhook does not respond in time,
for any reasion, API server should reject the request. We are safe on this.
Kubewarden default behavior already does that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the policy-server service is reachable from within the cluster, it would be possible to attempt a denial of service by accessing the service from within the cluster, by requesting a big amount of requests.

If the policy is configured to fail open (failurePolicy: ignore), then policies will have no effects if the endpoint is not reachable. If the policy is configured to fail closed (failuePolicy: fail), then it can cause a denial of service if the webhooks are not reachable.

Network policies would mitigate this threat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only reasonable way to mitigate this issue is to introduce allow each policy to run for at most X seconds. Once the timeout is reached the evaluation response will be a rejection accompanied by a certain error code and message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep failing closed. This is a good way to avoid bad actors from deploying malicious apps during a DOS.

timeouts as the admission controller uses compute power to process the workloads

**Mitigation**
Webhook fail closed and authenticate callers. We are safe, Kubewarden already does that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in Threat 1.

**Mitigation**
N/A

I cannot see how Kubewarden can help here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also not very clear in the original description "via a number of mechanisms.".

**To do**
Kubewarden should implement mutual TLS authentication
We can add in the recommended policies from the `kubewarden-defaults` Helm
chart a policy to drop the `NET_RAW` capability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NET_RAW containers will always exist though. CNI plugins are a typical example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose then drop NET_RAW capability besides in a rejectlist ns (control plane).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can prevent the creation of containers that have this capability via this policy. Is that one of the policies we enable via our "PSP best practices" selection?

Copy link
Member Author

@jvanz jvanz Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I had in mind. Adds the capabilities-psp-policy in the recommended policies skipping the control plane namespaces already listed in the chart.


**Mitigation**
All rules are reviewed and tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @raulcabello mentions, if an explicit pod-privileged-policy is installed (validating only), then this cannot happen. Regardless of whatever the mutating policy did, the validating policy has to approve the request.

This is a harder situation for example if a user deploys a policy that does mutation and validation in the same policy (e.g. to save time reducing requests made by the api server). If a policy does many things, then it's up to the policy author to make sure that this kind of situation does not happen.

**To do**
I think most of the RBAC is not responsability. But we can help our users if we:
- Warning them in our docs and *suggest* some minimum RBAC to be used.
- Provide a policy which detect RBAC changes and **maybe** block them. Is this possible?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think is a good practice to deploy ClusterAdmissionPolicy policies with a reject list by default. This ensures that these policies are enforced everywhere except for a well known list of namespaces where they are not enforced.

This makes Threat 11 harder to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree here.
I wonder if we could provide that rejectlist in a configmap deployed via kubewarden-defaults, that all ClusterAdmissionPolicy can use. We could even configure the default policy-server to apply that rejectlist for all ClusterAdmissionPolicies that users create, leaving the control plane and Kubewarden undisturbed.

Comment on lines +193 to +191
version not cover by the policy. We should warning policies developers to cover
all the supported API version in theirs tests and reject all of others.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is tricky. Imagine a policy already built some time ago. Now the API server adds support for ephemeral containers, but this policy only checks init and regular containers: there's no way in which we could have warned the policy author when they wrote the policy some time ago: the Kubernetes API didn't have ephemeral containers.

I think the best we can do is document in our blog when this kind of API addition requires to review existing policies in common API's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add metadata to policies, so authors are explicitly stating which bracket of API version they cover against. We as authors can review and bump as needed, but yep, it's a cat and mouse game.

I suppose we either cover all of K8s API for all supported versions with policies. Or we create a policy that rejects all k8s objects, besides a very specific API allowlist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's nice to allow our user to choose if they want to reject requests from unexpected API version or just accept the requests. As @viccuad said, this can be done using policies. Which is nice, because we do not need to change the core of the project and still give an option for our users to protect themselves from this.

I need to check, but if the context aware policies have read access to the Kuberwarden objects, they can read the mapped API version dynamically and rejecting requests of API version that are not listed there.

Comment on lines 207 to 209
- We may need to define all the API versions cover by the recommended policies
- We can create a configuration to reject by default requests where the API version not cover by the policy.
- We should warning policies developers to cover all the supported API version in theirs tests and reject all of others.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't apply here. Leftover?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide policies that can help with this and that can be deployed in the default policy-server namespace, where all policy server deployments are created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what rafa wrote

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, copy and past issue. 🤦‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, our policies cannot help on this. The threat said container deployed in the cluster node. Kubewarden does not have access to this information. Image a bad actors with access to the node running the controller, we cannot block them from running a privileged container in the node.

rfc/0006-threat-model.md Outdated Show resolved Hide resolved
rfc/0006-threat-model.md Outdated Show resolved Hide resolved
rfc/0006-threat-model.md Outdated Show resolved Hide resolved
**To do**
Kubewarden should implement mutual TLS authentication
We can add in the recommended policies from the `kubewarden-defaults` Helm
chart a policy to drop the `NET_RAW` capability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose then drop NET_RAW capability besides in a rejectlist ns (control plane).


**Mitigation**
All rules are reviewed and tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it sounds like a gotcha that deserves a paragraph in the policy authors docs.

**To do**
I think most of the RBAC is not responsability. But we can help our users if we:
- Warning them in our docs and *suggest* some minimum RBAC to be used.
- Provide a policy which detect RBAC changes and **maybe** block them. Is this possible?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree here.
I wonder if we could provide that rejectlist in a configmap deployed via kubewarden-defaults, that all ClusterAdmissionPolicy can use. We could even configure the default policy-server to apply that rejectlist for all ClusterAdmissionPolicies that users create, leaving the control plane and Kubewarden undisturbed.

rfc/0006-threat-model.md Show resolved Hide resolved
Comment on lines +193 to +191
version not cover by the policy. We should warning policies developers to cover
all the supported API version in theirs tests and reject all of others.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add metadata to policies, so authors are explicitly stating which bracket of API version they cover against. We as authors can review and bump as needed, but yep, it's a cat and mouse game.

I suppose we either cover all of K8s API for all supported versions with policies. Or we create a policy that rejects all k8s objects, besides a very specific API allowlist.

Copy link
Member

@viccuad viccuad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks :). Left some comments on the ideas, which I suppose would go into default policies, yet for me the RFC is ok.


Webhook fails closed. In other words, if the webhook does not respond in time,
for any reasion, API server should reject the request. We are safe on this.
Kubewarden default behavior already does that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only reasonable way to mitigate this issue is to introduce allow each policy to run for at most X seconds. Once the timeout is reached the evaluation response will be a rejection accompanied by a certain error code and message.

timeouts as the admission controller uses compute power to process the workloads

**Mitigation**
Webhook fail closed and authenticate callers. We are safe, Kubewarden already does that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in this case, I think the only reasonable way to mitigate this issue is to introduce allow each policy to run for at most X seconds. Once the timeout is reached the evaluation response will be a rejection accompanied by a certain error code and message.

Comment on lines 69 to 71
We may provide some mechanism to notify operators when a webhook is changed.
Maybe the controller can watch its webhooks, fix when the configuration is
changed and notify the operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a quite a list of things to do, I would not give the impression we're going to do something, unless there's a clear responsibility on our side/anything concrete we can do.

In this case, I think there's nothing we can do. The mitigation proposed by the paper makes sense. Our users can also leverage kwctl run to keep their configuration tested and look out for these edge cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment from Flavio makes me want to have all entries in this document with "test configuration" be "test configuration with kwctl", just to be explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the To do section does not seem to make sense in the context of this threat. The controller cannot know which is the "right" configuration. Thus, I cannot notify changes in the cluster that it's not taking care. This seems similar to the mitigation of other threats, so it can be a confusion during the document edit.

Comment on lines +82 to +81
I think most of the RBAC is not Kubewarden responsibility. But we can help our users if we:
- Warning them in our docs and *suggest* some minimum RBAC to be used.
- Provide a policy which detect RBAC changes and **maybe** block them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two things to take into account for this scenario.

The attacker changes one of our CR

An attacker could delete a ClusterAdmissionPolicy or an AdmissionPolicy object. There's nothing we can do in that case. We might want to highlight how important is to give the RBAC privileges to change these objects only to the users that are trusted.

The attacker changes a WebhookConfiguration object

Other than stating how important is to control via RBAC who can "touch" these Kubernetes objects... there's one question I have: is our controller able to identify the drift between what is described inside of a ClusterAdmissionPolicy/AdmissionPolicy and its related webhook configuration object? If our controller is able to notice that, we would be able to restore the policy.
This is ofc not enough, but would be helpful.

Copy link
Member Author

@jvanz jvanz Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the controller can compare the yaml configuration of the current webhook and the yaml which it generates from the policy definition. If it's different, it can reapply it. Cannot it?

**To do**
Kubewarden should implement mutual TLS authentication
We can add in the recommended policies from the `kubewarden-defaults` Helm
chart a policy to drop the `NET_RAW` capability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can prevent the creation of containers that have this capability via this policy. Is that one of the policies we enable via our "PSP best practices" selection?


**Mitigation**
All rules are reviewed and tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our "PSP best practices" selection deploys the pod-privileged-policy, this is not going to be an issue. Even if the user provides a mutating policy that by design/mistake makes a container privileged.

Comment on lines 207 to 209
- We may need to define all the API versions cover by the recommended policies
- We can create a configuration to reject by default requests where the API version not cover by the policy.
- We should warning policies developers to cover all the supported API version in theirs tests and reject all of others.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what rafa wrote

rfc/0006-threat-model.md Show resolved Hide resolved
rfc/0006-threat-model.md Outdated Show resolved Hide resolved
Copy link
Member Author

@jvanz jvanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubewarden/kubewarden-developers , JFYI: most of the mitigations came from the official k8s document.


Webhook fails closed. In other words, if the webhook does not respond in time,
for any reasion, API server should reject the request. We are safe on this.
Kubewarden default behavior already does that.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep failing closed. This is a good way to avoid bad actors from deploying malicious apps during a DOS.

timeouts as the admission controller uses compute power to process the workloads

**Mitigation**
Webhook fail closed and authenticate callers. We are safe, Kubewarden already does that.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users with permissions can change that. This is also mapped in threat #4. But I still believe we should keep failing closed. This is a good practice to avoid bad actors from deploying malicious apps during a DOS. Even if we add some timeout, given enough load, that may not be sufficient. Furthermore, as far as I can remember, users can change this behavior with helm values.

**To do**
Kubewarden should implement mutual TLS authentication
We can add in the recommended policies from the `kubewarden-defaults` Helm
chart a policy to drop the `NET_RAW` capability.
Copy link
Member Author

@jvanz jvanz Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I had in mind. Adds the capabilities-psp-policy in the recommended policies skipping the control plane namespaces already listed in the chart.


**Mitigation**
All rules are reviewed and tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "PSP best practices" includes the pod-privileged-policy.

Comment on lines +193 to +191
version not cover by the policy. We should warning policies developers to cover
all the supported API version in theirs tests and reject all of others.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's nice to allow our user to choose if they want to reject requests from unexpected API version or just accept the requests. As @viccuad said, this can be done using policies. Which is nice, because we do not need to change the core of the project and still give an option for our users to protect themselves from this.

I need to check, but if the context aware policies have read access to the Kuberwarden objects, they can read the mapped API version dynamically and rejecting requests of API version that are not listed there.

rfc/0006-threat-model.md Show resolved Hide resolved
rfc/0006-threat-model.md Outdated Show resolved Hide resolved
Comment on lines 207 to 209
- We may need to define all the API versions cover by the recommended policies
- We can create a configuration to reject by default requests where the API version not cover by the policy.
- We should warning policies developers to cover all the supported API version in theirs tests and reject all of others.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, copy and past issue. 🤦‍♂️

@jvanz jvanz force-pushed the rfc-threat-model branch 2 times, most recently from e1f9db2 to aa979ec Compare April 22, 2022 18:42
@jvanz jvanz requested review from ereslibre and flavio April 22, 2022 18:45
@jvanz
Copy link
Member Author

jvanz commented May 17, 2022

@kubewarden/kubewarden-developers, I've reviewed the conversations in this PR and I come up with some actionable items from all the team. Thus, we can discuss the items, change or even drop some of them. They are:

  1. Document the importance of properly RBAC configured privileges.  docs#115

  2. Document the importance of properly RBAC configured privileges. Only the right users should be allowed to manipulate CRD objects. docs#116

  3. Adds a documentation in the policy author section warning about the security threat of the same policy performing mutation and validation. docs#117

  4. Add the capabilities-psp-policy in the recommended policies list from kubewaraden-default chart helm-charts#90

  5. Add the hostpaths-psp-policy in the recommended policies list from kubewaraden-default chart. helm-charts#91

  6. Add timeout to policy evaluation policy-server#254

  7. Research if controller is able to detect and fix changes between the webhook object and its related ClusterAdmissionPolicy/AdmissionPolicy configuration. kubewarden-controller#224

  8. Implement mutual TLS authentication kubewarden-controller#225

  9. Global reject list for ClusterAdmissionPolicy. To allow operators define the namespaces where the policies should ignore. Or document a best practice to add this reject list in the policy definition. kubewarden-controller#226

  10. Research a solution to allow operator rejects requests using not allowed Kubernetes API versions.  kubewarden-controller#227

  11. Provide Software Bill Of Materials kubewarden-controller#228

  12. Sign Helm charts helm-charts#92

I believe we should do items 1 to 5 now. They are easy to do. All the other should be reviewed with more care.

Please, let me know if I forgot something.

@jvanz jvanz requested review from a team, raulcabello and viccuad May 17, 2022 14:47
@jvanz jvanz force-pushed the rfc-threat-model branch from aa979ec to d726248 Compare May 17, 2022 14:51
@viccuad
Copy link
Member

viccuad commented May 17, 2022

That list of action items sounds fine to me. We could create cards already for each of them and discuss them in the planning.

The PR is also complete from my point of view. Nice job!

Copy link
Contributor

@raulcabello raulcabello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks great! Thanks @jvanz

Copy link
Member

@ereslibre ereslibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @jvanz.

Although I have some doubts on how to approach some specific issues, I think the split looks great and actionable.

Adds a RFC to discuss the admission control threat model and its
mitigations.
@jvanz jvanz force-pushed the rfc-threat-model branch 2 times, most recently from 7c772a5 to d87391c Compare May 24, 2022 12:18
@jvanz jvanz merged commit 465203e into main May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-check security posture against the k8s admission control threat model
5 participants