Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet/Kubernetes should work with Swap Enabled #53533

Closed
outcoldman opened this issue Oct 6, 2017 · 122 comments · Fixed by #102823
Closed

Kubelet/Kubernetes should work with Swap Enabled #53533

outcoldman opened this issue Oct 6, 2017 · 122 comments · Fixed by #102823
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@outcoldman
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

Kubelet/Kubernetes 1.8 does not work with Swap enabled on Linux Machines.

I have found this original issue #31676
This PR #31996
and last change which enabled it by default 71e8c8e

If Kubernetes does not know how to handle memory eviction when Swap is enabled - it should find a way how to do that, but not asking to get rid of swap.

Please follow kernel.org Chapter 11 Swap Management, for example

The casual reader may think that with a sufficient amount of memory, swap is unnecessary but this brings us to the second reason. A significant number of the pages referenced by a process early in its life may only be used for initialisation and then never used again. It is better to swap out those pages and create more disk buffers than leave them resident and unused.

In case of running a lot of node/java applications I have seen always a lot of pages are swapped, just because they aren't used anymore.

What you expected to happen:

Kubelet/Kubernetes should work with Swap enabled. I believe instead of disabling swap and giving users no choices kubernetes should support more use cases and various workloads, some of them can be an applications which might rely on caches.

I am not sure how kubernetes decided what to kill with memory eviction, but considering that Linux has this capability, maybe it should align with how Linux does that? https://www.kernel.org/doc/gorman/html/understand/understand016.html

I would suggest to rollback the change for failing when swap is enabled, and revisit how the memory eviction works currently in kubernetes. Swap can be important for some workloads.

How to reproduce it (as minimally and precisely as possible):

Run kubernetes/kublet with default settings on linux box

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/sig node
cc @mtaufen @vishh @derekwaynecarr @dims

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 6, 2017
@derekwaynecarr
Copy link
Member

Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods.

We discussed this topic at the resource mgmt face to face earlier this year. We are not super interested in tackling this in the near term relative to the gains it could realize. We would prefer to improve reliability around pressure detection, and optimize issues around latency before trying to optimize for swap, but if this is a higher priority for you, we would love your help.

@derekwaynecarr
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 7, 2017
@liggitt liggitt removed the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2017
@outcoldman
Copy link
Author

@derekwaynecarr thank you for explanation! It was hard to get any information/documentation why swap should be disabled for kubernetes. This was the main reason why I opened this topic. At this point I do not have high priority for this issue, just wanted to be sure that we have a place where it can be discussed.

@matthiasr
Copy link

There is more context in the discussion here: #7294 – having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would then start spilling over into swap (this appears to be fixed since f4edaf2 – they won't be allowed to use any swap whether it's there or not).

@fieryorc
Copy link

fieryorc commented Jan 2, 2018

This is critical use case for us too. We have a cron job that occasionally runs into high memory usage (>30GB) and we don't want to permanently allocate 40+GB nodes. Also, given that we run in three zones (GKE), this will allocate 3 such machines (1 in each zone). And this configuration has to be repeated in 3+ production instances and 10+ test instances making this super expensive to use K8s. We are forced to have 25+ 48GB nodes which incurs huge cost!.
Please enable swap!.

@hjwp
Copy link

hjwp commented Jan 5, 2018

A workaround for those who really want swap. If you

  • start kubelet with --fail-swap-on=false
  • add swap to your nodes
  • containers which do not specify a memory requirement will then by default be able to use all of the machine memory, including swap.

That's what we're doing. Or at least, I'm pretty sure it is, I didn't actually implement it personally, but that's what I gather.

This might only really be a viable strategy if none of your containers ever specify an explicit memory requirement...

@fieryorc
Copy link

fieryorc commented Jan 6, 2018

We run in GKE, and I don't know of a way to set those options.

@vishh
Copy link
Contributor

vishh commented Jan 25, 2018

I'd be open to considering adopting zswap if someone can evaluate the implications to memory evictions in kubelet.

@icewheel
Copy link

icewheel commented Jan 30, 2018

I am running Kubernetes in my local Ubuntu laptop and with each restart I have to turnoff swap. Also I have to worry about not to go near memory limit as swap is off.

Is there any way with each restart I don't have to turn off swap like some configuration file change in existing installation?

I don't need swap on nodes running in cluster.

Its just other applications on my laptop other than Kubernetes Local Dev cluster who need swap to be turned on.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Right now the flag is not working.

# systemctl restart kubelet --fail-swap-on=false
systemctl: unrecognized option '--fail-swap-on=false'

@mtaufen
Copy link
Contributor

mtaufen commented Feb 2, 2018 via email

@icewheel
Copy link

icewheel commented Feb 2, 2018

thanks @mtaufen

@dbogatov
Copy link

For systems that bootstrap cluster for you (like terraform), you may need to modify the service file

This worked for me

sudo sed -i '/kubelet-wrapper/a \ --fail-swap-on=false \\\' /etc/systemd/system/kubelet.service

@srevenant
Copy link

srevenant commented Apr 3, 2018

Not supporting swap as a default? I was surprised to hear this -- I thought Kubernetes was ready for the prime time? Swap is one of those features.

This is not really optional in most open use cases -- it is how the Unix ecosystem is designed to run, with the VMM switching out inactive pages.

If the choice is no swap or no memory limits, I'll choose to keep swap any day, and just spin up more hosts when I start paging, and I will still come out saving money.

Can somebody clarify -- is the problem with memory eviction only a problem if you are using memory limits in the pod definition, but otherwise, it is okay?

It'd be nice to work in a world where I have control over the way an application memory works so I don't have to worry about poor memory usage, but most applications have plenty of inactive memory space.

I honestly think this recent move to run servers without swap is driven by the PaaS providers trying to coerce people into larger memory instances--while disregarding ~40 years of memory management design. The reality is that the kernel is really good about knowing what memory pages are active or not--let it do its job.

@chrissound
Copy link

This also has an effect that if the memory gets exhausted on the node, it will potentially become completely locked up - requiring a restart of the node, rather than just slowing down and recovering a while later.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2018
@ehashman
Copy link
Member

/assign

@abdennour
Copy link

swapoff -a && systemctl restart kubelet my way in offline environment

@agowa
Copy link

agowa commented Feb 26, 2021

@abdennour That doesn't solve the issue. You're just disabling swap. That depending on your workload may or may not be viable as has been already pointed out within this issue.

@t3hmrman
Copy link

t3hmrman commented Mar 29, 2021

Is there any actual downside to leaving swap on and setting fail-on-swap=false?

I understand the decision/recommendation and can understand it will take some time to work through reconsidering it, but the regardless of how that goes am I correct in thinking that the only actual immediate downside is over-committed memory and the resulting degraded memory performance for some workloads on the margin?

The scheduler does not take swap into account when determining node resources right, and the OOMKiller will still come around and kill processes that escape their limits -- theoretically a node with no Burstable/BestEffort-classified workloads (with little/no additional usage from outside external sources) would still function well especially when you're not at the margins right?

@ehashman
Copy link
Member

ehashman commented Apr 6, 2021

Swap KEP draft up at kubernetes/enhancements#2602

Feature tracked at kubernetes/enhancements#2400

Aiming for an alpha MVP for 1.22 release (the upcoming one). PTAL!

@ehashman
Copy link
Member

ehashman commented Apr 6, 2021

(And sorry, I realized I assigned this and left everyone hanging - I've sent out some emails to the mailing list and we have been iterating on a design doc I used to develop the draft KEP above.)

@deavid
Copy link

deavid commented Apr 22, 2021

I'm new to Kubernetes and just learned about this issue; I think swap should get at least a very minimal support for swap.
I mostly understand why swap is a problem to implement and why is seen as something that doesn't add much value: Linux doesn't provide much tooling around swap to properly control it (although possible) and as Kubernetes is expected to run many pods then one would run it in a big machine with lots of memory. In this scenario enabling swap will not bring almost any benefit.

But Kubernetes should aim for broader support of other scenarios. For example, I'm planning to have 3 very small nodes to run 1 pod each, and use Kubernetes mainly for replica and fail-over. Nothing fancy, just one big app on three VPS.

When the amount of memory of the host is small, having swap is critical for the stability of the host system. A Linux distribution does not run on constant or pre-allocated memory, therefore there is always a chance that something in the host OS could produce a surge in memory and without swap the oom killer would be invoked. And in my experience, when the linux OOM comes in the results are nothing good, and configuring it properly requires extended knowledge on how your particular OS installation behaves, what's critical and what's not.

Following this train of thought my problem is more about Kubernetes requiring the sysadmin to disable swap entirely on the node than having proper swap support on the pods. Showing a nasty warning instead of failing to start it seems a better option to me than require to set a flag.

Having proper swap support for pods sounds also really interesting as it can make the nodes very dense, which it can be interesting on certain applications. (Some apps preallocate a lot of memory, touch it and almost never go back to it). And we're also seeing faster drives lately, PCIe 4.0 and new standards for using drives as memory; with these, moving back from disk to memory is fast enough to consider swapping as an option to get more stuff packed per server.

My point here is basically: 1) I believe swap support is needed. 2) kubernetes doesn't need to get from 0 to 100 in one shot, there are lots of middle options that are also reasonably valid that would mitigate the majority of issues people have with removing swap entirely.

@ehashman
Copy link
Member

Since we have a lot of folks commenting on this issue who are new to the Kubernetes development process, I'll try to explain what I linked above a bit more clearly.

  • I've assigned myself to this issue. This means I am working on the implementation for MVP swap support.
  • This is a very large feature, so we can't track its implementation just with a GitHub issue. It must go through the Kubernetes enhancements process aka a KEP to ensure that all of the API changes, etc. get properly reviewed.
  • It will take a minimum of 3 release cycles to "graduate". I am targeting the 1.22 release [August 2021] for alpha support. Alpha means you must enable the feature flag to use it. I'm targeting 1.23 [December 2021] for beta support, where the feature flag would be on by default. In either case, an end user would still need to provision swap and explicitly enable swap support in the kubelet.
  • The design proposal I'm working on is here: https://github.com/ehashman/k-enhancements/blob/kep-2400/keps/sig-node/2400-node-swap/README.md#summary
  • The design proposal has not yet been accepted, but the deadline for this upcoming release cycle is May 13th, so we will know soon, and I'll post an update here when it is marked implementable.

At this point I don't think there's opposition to implementing some kind of swap support, it's just a matter of doing so carefully and in a way that will address most use cases without adding too much complexity or breaking existing users.

@ehashman
Copy link
Member

My proposal has been accepted for the 1.22 cycle.

We will proceed with the implementation described in the design doc: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap

@adisky
Copy link
Contributor

adisky commented Jun 25, 2021

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 25, 2021
@superdave
Copy link

Spectacular work, @ehashman! Thank you so much for driving this through.

@ehashman
Copy link
Member

ehashman commented Jul 7, 2021

Greetings, friends!

This issue closed as my PR has merged and we now have alpha swap support available in Kubernetes. It should be available in the next 1.22 branch release cut (v1.22.0-beta.1).

There are a few things to keep in mind:

  • The CRI has changed, so container runtimes (e.g. containerd, cri-o) won't be able to actually accept swap configs until they have been updated against the new CRI. That work is in flight. I'll try to include the minimum supported versions in documentation.
  • The work isn't over with this release; there will be a multi-release graduation process. I don't anticipate this graduating until at least 1.25. I am targeting beta for 1.23, at which point the feature flag will be on by default, but it's possible that will slip to 1.24.
  • I am working on a feature blog for the 1.22 release highlights, as well as documentation for how to turn this on in a cluster. The steps will vary depending on how you deploy Kubernetes.

@BenTheElder
Copy link
Member

The CRI has changed, so container runtimes (e.g. containerd, cri-o) won't be able to actually accept swap configs until they have been updated against the new CRI. That work is in flight. I'll try to include the minimum supported versions in documentation.

Is kubernetes/enhancements#2400 the best place to keep an eye out for that work?

@ehashman
Copy link
Member

ehashman commented Jul 8, 2021

Yup, that's right. Future work will be tracked on that issue. I think our beta criteria are mostly solid, so it's a matter of whether we'll be able to get all the work done for beta next release, as there is a lot to do. Then there will be some lag time between beta and GA as we gather feedback and make updates.

Help definitely wanted! If anyone following this issue wants to jump in, you can reach out to me at ehashman at redhat dot com or on k8s Slack (@ehashman).

@Alceatraz
Copy link

Just for assuming:

With swap can avoid crash when memory flow for OS, And k8s/CRI as software not allow process use swap, Will that be difficult or cause some problem?

@misko
Copy link

misko commented Mar 27, 2022

I have been trying to find work arounds for swap and eventually I just wrote a user space based swap solution using mmap. I've been using it for a week now and it seems to work pretty good https://github.com/misko/bigmaac.git . Not sure if this helps anyone here, to swap a process use LD_PRELOAD=./bigmaac.so your-exec with args if you want to only swap memory allocations > Xbytes, LD_PRELOAD=./bigmaac.so BIGMAAC_MIN_SIZE=X your-exec with args

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.