default node pool is suffering from memory pressure
As seen from incident: production#2381 (closed), one of the steps in troubleshooting showed us that our default node pool is occasionally suffering memory pressure. This suggests that we've not properly sized our workloads to fit on the existing cluster.
There are at least 3 items that we need to work on here:
- We have no alerting or dashboards that show us when a Pod is evicted. These gather up over the course of time. We do not clean these up at this moment which has allowed us to determine why Pods are being evicted, captured in this comment: production#2381 (comment 375248360)
- We need to evaluate our memory usage of this workload. Consider re-evaluating our memory requests that should enable GKE to scale up the available nodes removing memory pressure as the workloads are more scattered.
- Consider evaluating the node type. We've recently added additional workloads to the cluster and we may not be effectively distributing loads on the cluster. This will be more difficult to accomplish as we lock some workloads to some node pools. Consider this an optional item if the above two are insufficient at resolving the problem.
- Reconsider pod memory limits if the node pool is changed, specifically consider if we should revert: gitlab-com/gl-infra/k8s-workloads/gitlab-com!299 (comment 379975749)
Edited by John Skarbek