Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

Closed
lpreson opened this issue Mar 7, 2017 · 68 comments
Assignees
Labels
area/workload-api/cronjob kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects

Comments

@lpreson
Copy link

lpreson commented Mar 7, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
CronJob
ScheduledJob
"Too many missed start times to list"

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Feature Request

Kubernetes version (use kubectl version):

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): CentOS Linux release 7.3.1611 (Core)
  • Kernel (e.g. uname -a):Linux 4.9.11-1.el7.elrepo.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sat Feb 18 18:16:50 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: Custom
  • Others:

What happened:
Cluster shutdown nightly to reduce costs. On startup the cluster refuses to run jobs.
The controller manager errors with:
E0307 15:40:23.754617 1 controller.go:163] Cannot determine if default/ needs to be started: Too many missed start times to list

What you expected to happen:
The cluster will run scheduled jobs after being restored.

How to reproduce it (as minimally and precisely as possible):
Set a regular schedule such as */1 * * * *
Shutdown for 100 scheduled executions of the job (100 minutes).

Anything else we need to know:
This seems to be related to:
Function: getRecentUnmetScheduleTimes
v1.4: https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/controller/scheduledjob/utils.go#L169

@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-bugs

@0xmichalis 0xmichalis added area/workload-api/cronjob sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Mar 7, 2017
@soltysh
Copy link
Contributor

soltysh commented Mar 8, 2017

There was similar problem recently fixed #36311, although in this particular use case it's hard to distinguish whether it was down or something else broke, from a controller point of view. I could only suggest tweaking .spec.startingDeadlineSeconds for the time.

@lpreson
Copy link
Author

lpreson commented Mar 8, 2017

@soltysh if I set .spec.startingDeadlineSeconds to a high number will it trigger once on resume or trigger all missed jobs?

@soltysh
Copy link
Contributor

soltysh commented Mar 8, 2017

@lpreson up to .spec.startingDeadlineSeconds but then again you could try setting .spec.restartPolicy to Forbid, this way only one will be run at the time. I must admit your use case is far from normal, due to shutting down the cluster, but reasonable for the controller to work stably.

@lpreson
Copy link
Author

lpreson commented Mar 8, 2017

@soltysh This could be a common approach for money saving in the cloud when running non production clusters.

Thank you for the interim suggestion, I will try this out.

@0xmichalis 0xmichalis added the kind/bug Categorizes issue or PR as related to a bug. label May 21, 2017
@0xmichalis
Copy link
Contributor

Also reported in #45825 - @soltysh @erictune is it something we are targetting to fix in 1.7?

@soltysh
Copy link
Contributor

soltysh commented May 21, 2017

With the current time-frame I'm not sure. I'll be sweeping through my bugs over the next few weeks. I'll see what I can go with.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2017
@skynardo
Copy link

skynardo commented Jan 3, 2018

@lpreson - I was wondering if setting .spec.startingDeadlineSecond and or setting .spec.restartPolicy to Forbid resolved this issue. We have a similar issue with 2 of our non-prod clusters that we shut down each night.

@mattfarina
Copy link
Contributor

/remove-lifecycle stale

@skynardo Can you share the version of Kubernetes you are still experiencing this on?

@kow3ns did you know about this one?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2018
@ApsOps
Copy link
Contributor

ApsOps commented Feb 16, 2018

I'm still experiencing this on v1.8.6.

@kow3ns kow3ns added this to Backlog in Workloads Feb 27, 2018
@mludvig
Copy link

mludvig commented Mar 14, 2018

I've got the same problem with Minikube 0.25.0 running Kubernetes server 1.9.0 in a VirtualBox on my laptop. When I close the lid and laptop suspends the minikube VM obviously stops and upon resume my scheduled job keeps failing with:

Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

@soltysh
Copy link
Contributor

soltysh commented Mar 16, 2018

This is due to the hard-coded value of how many missed start times the cronjob controller can handle. It's reasonable to have it configurable, I guess.

@soltysh
Copy link
Contributor

soltysh commented Mar 20, 2018

Actually I was thinking about it more this morning. Currently if we exceed the artificial, hardcoded limit the cronjob will always error out without any chance to recover the cronjob. Maybe we should error out as we do today and pause the cronjob to prevent it from further erroring out, but at the same time giving the user the ability to restart it. I'd like to hear what others think about such approach.

@mludvig
Copy link

mludvig commented Mar 20, 2018

Some restart button was exactly what I was looking for in the dashboard! Couldn't find a way to do it and had to delete and recreate the job :(
Definitely +1 from me ;)

@skynardo
Copy link

I would prefer it be configurable. I could then set it > the number of scheduled starts it would have had overnight while our EC2 instances are shut down. I need something automated. I was thinking about adding a check to my monitoring pod to see if the scheduled job was dead then recreate it.

@boosty
Copy link

boosty commented Mar 20, 2018

I would prefer it be configurable

+1

@gtaylor
Copy link
Contributor

gtaylor commented Apr 3, 2018

This just bit us today, too. We had to work around the issue by deleting and re-adding part of our Helm chart on the cluster :(

Would be great to be able to disable/re-enable a CronJob.

@dzoeteman
Copy link

dzoeteman commented Apr 10, 2018

Not only being able to restart the CronJob would be nice, but also exclude actually suspending a CronJob - in this case, the user has specified he doesn't want the CronJob to be run, so it should be okay to start again on schedule as soon as the user has resumed the CronJob.

If the above is not a major change - I'd like to pick it up if someone else hasn't yet.

@soltysh
Copy link
Contributor

soltysh commented Apr 18, 2018

Not only being able to restart the CronJob would be nice, but also exclude actually suspending a CronJob - in this case, the user has specified he doesn't want the CronJob to be run, so it should be okay to start again on schedule as soon as the user has resumed the CronJob.

@dzoeteman suspension is already supported, see .spec.suspend which is false by default.

@gtaylor
Copy link
Contributor

gtaylor commented Apr 18, 2018

If I flip a "stuck" CronJob to spec.suspend = true then back to false, do I get un-stuck? Or is the answer for sure "Delete and re-create the CronJob?"

@Bo0km4n
Copy link

Bo0km4n commented Jul 28, 2020

Same issue happen on Kubernetes 1.18.6

@zentale
Copy link

zentale commented Oct 13, 2020

Another fix for this same thing? #89397

@soltysh
Copy link
Contributor

soltysh commented Nov 26, 2020

This is being fixed in the new controller #93370

@Skybladev2
Copy link

Skybladev2 commented Jan 5, 2021

Still experience it with minikube v1.16.0 (Kubernetes v1.20.0)

@alculquicondor
Copy link
Member

@Skybladev2 did you enable the CronJobControllerV2 feature gate?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021
@unixfox
Copy link

unixfox commented Apr 26, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021
@soltysh
Copy link
Contributor

soltysh commented Jul 8, 2021

This should now be resolved with the new controller implementation.
/close

@k8s-ci-robot
Copy link
Contributor

@soltysh: Closing this issue.

In response to this:

This should now be resolved with the new controller implementation.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Workloads automation moved this from Backlog to Done Jul 8, 2021
eni23 added a commit to adfinis/openshift-etcd-backup that referenced this issue Aug 11, 2021
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime.

See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details
eni23 added a commit to adfinis/helm-charts that referenced this issue Aug 11, 2021
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime.

See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details
eni23 added a commit to adfinis/helm-charts that referenced this issue Aug 11, 2021
* Fix for openshift-etcd-backup cronjob

For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime.

See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details

* update chart version
eni23 added a commit to adfinis/openshift-etcd-backup that referenced this issue Sep 10, 2021
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime.

See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details
HeroCC added a commit to HeroCC/infra that referenced this issue Jan 24, 2022
If the cluster is down for an extended period of time, the cronjob will fail until recreated. The behavior is fixed in CronV2, but we don't use it yet. For now, use this fix.

See kubernetes/kubernetes#42649
@hughesadam87
Copy link

hughesadam87 commented Feb 23, 2022

FWIW - another use case besides just shutting the cluster down is pausing the job. In GCP it's easy to pause a cron job, and we often do this. If the job is having an issue, I'll pause the prod container for hours while its being debugged and then run into this issue when trying to resume. We are using old GKE/K8s version so am hoping it goes away when upgraded.

image

The "Run Now" still works for one-off runs, but then the cron does not resume after. I'm going to try the restartPolicy=forbid as mentioned above

Shuanglu pushed a commit to Shuanglu/istio-tools that referenced this issue Jun 30, 2022
* Fix CronJob deadlock

Apparently if we fail to start 100 jobs we just stop forever.. this left
my cluster in a state with no endpoint updates. See
kubernetes/kubernetes#42649

* fix type
Shuanglu pushed a commit to Shuanglu/istio-tools that referenced this issue Jul 6, 2022
* Fix CronJob deadlock

Apparently if we fail to start 100 jobs we just stop forever.. this left
my cluster in a state with no endpoint updates. See
kubernetes/kubernetes#42649

* fix type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/cronjob kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Workloads
  
Done
Development

Successfully merging a pull request may close this issue.