Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

lpreson · 2017-03-07T16:00:51Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
CronJob
ScheduledJob
"Too many missed start times to list"

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Feature Request

Kubernetes version (use kubectl version):

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): CentOS Linux release 7.3.1611 (Core)
Kernel (e.g. uname -a):Linux 4.9.11-1.el7.elrepo.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sat Feb 18 18:16:50 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: Custom
Others:

What happened:
Cluster shutdown nightly to reduce costs. On startup the cluster refuses to run jobs.
The controller manager errors with:
E0307 15:40:23.754617 1 controller.go:163] Cannot determine if default/ needs to be started: Too many missed start times to list

What you expected to happen:
The cluster will run scheduled jobs after being restored.

How to reproduce it (as minimally and precisely as possible):
Set a regular schedule such as */1 * * * *
Shutdown for 100 scheduled executions of the job (100 minutes).

Anything else we need to know:
This seems to be related to:
Function: getRecentUnmetScheduleTimes
v1.4: https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/controller/scheduledjob/utils.go#L169

The text was updated successfully, but these errors were encountered:

0xmichalis · 2017-03-07T16:43:54Z

@kubernetes/sig-apps-bugs

soltysh · 2017-03-08T12:57:02Z

There was similar problem recently fixed #36311, although in this particular use case it's hard to distinguish whether it was down or something else broke, from a controller point of view. I could only suggest tweaking .spec.startingDeadlineSeconds for the time.

lpreson · 2017-03-08T14:19:36Z

@soltysh if I set .spec.startingDeadlineSeconds to a high number will it trigger once on resume or trigger all missed jobs?

soltysh · 2017-03-08T14:22:21Z

@lpreson up to .spec.startingDeadlineSeconds but then again you could try setting .spec.restartPolicy to Forbid, this way only one will be run at the time. I must admit your use case is far from normal, due to shutting down the cluster, but reasonable for the controller to work stably.

lpreson · 2017-03-08T14:31:00Z

@soltysh This could be a common approach for money saving in the cloud when running non production clusters.

Thank you for the interim suggestion, I will try this out.

0xmichalis · 2017-05-21T12:42:57Z

Also reported in #45825 - @soltysh @erictune is it something we are targetting to fix in 1.7?

soltysh · 2017-05-21T16:07:38Z

With the current time-frame I'm not sure. I'll be sweeping through my bugs over the next few weeks. I'll see what I can go with.

fejta-bot · 2017-12-24T22:36:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

skynardo · 2018-01-03T14:15:06Z

@lpreson - I was wondering if setting .spec.startingDeadlineSecond and or setting .spec.restartPolicy to Forbid resolved this issue. We have a similar issue with 2 of our non-prod clusters that we shut down each night.

mattfarina · 2018-01-10T15:08:27Z

/remove-lifecycle stale

@skynardo Can you share the version of Kubernetes you are still experiencing this on?

@kow3ns did you know about this one?

ApsOps · 2018-02-16T10:23:51Z

I'm still experiencing this on v1.8.6.

mludvig · 2018-03-14T05:07:02Z

I've got the same problem with Minikube 0.25.0 running Kubernetes server 1.9.0 in a VirtualBox on my laptop. When I close the lid and laptop suspends the minikube VM obviously stops and upon resume my scheduled job keeps failing with:

Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

soltysh · 2018-03-16T17:11:32Z

This is due to the hard-coded value of how many missed start times the cronjob controller can handle. It's reasonable to have it configurable, I guess.

soltysh · 2018-03-20T09:06:41Z

Actually I was thinking about it more this morning. Currently if we exceed the artificial, hardcoded limit the cronjob will always error out without any chance to recover the cronjob. Maybe we should error out as we do today and pause the cronjob to prevent it from further erroring out, but at the same time giving the user the ability to restart it. I'd like to hear what others think about such approach.

mludvig · 2018-03-20T10:36:05Z

Some restart button was exactly what I was looking for in the dashboard! Couldn't find a way to do it and had to delete and recreate the job :(
Definitely +1 from me ;)

skynardo · 2018-03-20T12:57:39Z

I would prefer it be configurable. I could then set it > the number of scheduled starts it would have had overnight while our EC2 instances are shut down. I need something automated. I was thinking about adding a check to my monitoring pod to see if the scheduled job was dead then recreate it.

boosty · 2018-03-20T13:13:48Z

I would prefer it be configurable

+1

gtaylor · 2018-04-03T21:41:10Z

This just bit us today, too. We had to work around the issue by deleting and re-adding part of our Helm chart on the cluster :(

Would be great to be able to disable/re-enable a CronJob.

dzoeteman · 2018-04-10T07:53:35Z

Not only being able to restart the CronJob would be nice, but also exclude actually suspending a CronJob - in this case, the user has specified he doesn't want the CronJob to be run, so it should be okay to start again on schedule as soon as the user has resumed the CronJob.

If the above is not a major change - I'd like to pick it up if someone else hasn't yet.

soltysh · 2018-04-18T15:38:16Z

Not only being able to restart the CronJob would be nice, but also exclude actually suspending a CronJob - in this case, the user has specified he doesn't want the CronJob to be run, so it should be okay to start again on schedule as soon as the user has resumed the CronJob.

@dzoeteman suspension is already supported, see .spec.suspend which is false by default.

gtaylor · 2018-04-18T20:29:52Z

If I flip a "stuck" CronJob to spec.suspend = true then back to false, do I get un-stuck? Or is the answer for sure "Delete and re-create the CronJob?"

Bo0km4n · 2020-07-28T07:11:23Z

Same issue happen on Kubernetes 1.18.6

zentale · 2020-10-13T09:27:46Z

Another fix for this same thing? #89397

soltysh · 2020-11-26T18:00:21Z

This is being fixed in the new controller #93370

Skybladev2 · 2021-01-05T19:48:15Z

Still experience it with minikube v1.16.0 (Kubernetes v1.20.0)

alculquicondor · 2021-01-26T18:39:29Z

@Skybladev2 did you enable the CronJobControllerV2 feature gate?

fejta-bot · 2021-04-26T18:45:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

unixfox · 2021-04-26T21:31:57Z

/remove-lifecycle stale

soltysh · 2021-07-08T14:12:59Z

This should now be resolved with the new controller implementation.
/close

k8s-ci-robot · 2021-07-08T14:13:06Z

@soltysh: Closing this issue.

In response to this:

This should now be resolved with the new controller implementation.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details

* Fix for openshift-etcd-backup cronjob For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details * update chart version

For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details

If the cluster is down for an extended period of time, the cronjob will fail until recreated. The behavior is fixed in CronV2, but we don't use it yet. For now, use this fix. See kubernetes/kubernetes#42649

hughesadam87 · 2022-02-23T17:16:43Z

FWIW - another use case besides just shutting the cluster down is pausing the job. In GCP it's easy to pause a cron job, and we often do this. If the job is having an issue, I'll pause the prod container for hours while its being debugged and then run into this issue when trying to resume. We are using old GKE/K8s version so am hoping it goes away when upgraded.

The "Run Now" still works for one-off runs, but then the cron does not resume after. I'm going to try the restartPolicy=forbid as mentioned above

* Fix CronJob deadlock Apparently if we fail to start 100 jobs we just stop forever.. this left my cluster in a state with no endpoint updates. See kubernetes/kubernetes#42649 * fix type

0xmichalis assigned soltysh Mar 7, 2017

0xmichalis added area/workload-api/cronjob sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Mar 7, 2017

0xmichalis mentioned this issue May 21, 2017

CronJob won't run again after being suspended long enough #45825

Closed

0xmichalis assigned erictune May 21, 2017

0xmichalis added the kind/bug Categorizes issue or PR as related to a bug. label May 21, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2017

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2018

kow3ns added this to Backlog in Workloads Feb 27, 2018

lingsamuel mentioned this issue Aug 17, 2020

Automatically delete and start sync cronjob after a long cluster down time vmware-tanzu/kubeapps#1941

Closed

Bobgy mentioned this issue Nov 9, 2020

GCP: cleanup_ci.py failing at cron kubeflow/testing#793

Closed

SerialVelocity mentioned this issue Mar 18, 2021

Allow setting .spec.startingDeadlineSeconds in the helm cronjob elemental-lf/benji#108

Closed

PhanLe1010 mentioned this issue Apr 20, 2021

[BUG] Recurring jobs never run again when the volume is attached after being detached for a long time longhorn/longhorn#2513

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021

This was referenced May 9, 2021

Workaround longhorn cronjob rancherlabs/support-tools#116

Closed

Workaround longhorn cronjob fix rancherlabs/support-tools#117

Open

k8s-ci-robot closed this as completed Jul 8, 2021

Workloads automation moved this from Backlog to Done Jul 8, 2021

eni23 mentioned this issue Aug 11, 2021

Add startingDeadlineSeconds to cronjob adfinis/openshift-etcd-backup#15

Merged

eni23 mentioned this issue Aug 11, 2021

Fix for openshift-etcd-backup cronjob adfinis/helm-charts#319

Merged

6 tasks

2rs2ts mentioned this issue Jan 8, 2024

Kubernetes cronjob documentation out of date? kubernetes/website#44525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

lpreson commented Mar 7, 2017

0xmichalis commented Mar 7, 2017

soltysh commented Mar 8, 2017

lpreson commented Mar 8, 2017

soltysh commented Mar 8, 2017

lpreson commented Mar 8, 2017 •

edited

0xmichalis commented May 21, 2017

soltysh commented May 21, 2017

fejta-bot commented Dec 24, 2017

skynardo commented Jan 3, 2018

mattfarina commented Jan 10, 2018

ApsOps commented Feb 16, 2018

mludvig commented Mar 14, 2018

soltysh commented Mar 16, 2018

soltysh commented Mar 20, 2018

mludvig commented Mar 20, 2018

skynardo commented Mar 20, 2018

boosty commented Mar 20, 2018

gtaylor commented Apr 3, 2018

dzoeteman commented Apr 10, 2018 •

edited

soltysh commented Apr 18, 2018 •

edited

gtaylor commented Apr 18, 2018

Bo0km4n commented Jul 28, 2020

zentale commented Oct 13, 2020

soltysh commented Nov 26, 2020

Skybladev2 commented Jan 5, 2021 •

edited

alculquicondor commented Jan 26, 2021

fejta-bot commented Apr 26, 2021

unixfox commented Apr 26, 2021

soltysh commented Jul 8, 2021

k8s-ci-robot commented Jul 8, 2021

hughesadam87 commented Feb 23, 2022 •

edited

Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

Comments

lpreson commented Mar 7, 2017

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): CronJob ScheduledJob "Too many missed start times to list"

0xmichalis commented Mar 7, 2017

soltysh commented Mar 8, 2017

lpreson commented Mar 8, 2017

soltysh commented Mar 8, 2017

lpreson commented Mar 8, 2017 • edited

0xmichalis commented May 21, 2017

soltysh commented May 21, 2017

fejta-bot commented Dec 24, 2017

skynardo commented Jan 3, 2018

mattfarina commented Jan 10, 2018

ApsOps commented Feb 16, 2018

mludvig commented Mar 14, 2018

soltysh commented Mar 16, 2018

soltysh commented Mar 20, 2018

mludvig commented Mar 20, 2018

skynardo commented Mar 20, 2018

boosty commented Mar 20, 2018

gtaylor commented Apr 3, 2018

dzoeteman commented Apr 10, 2018 • edited

soltysh commented Apr 18, 2018 • edited

gtaylor commented Apr 18, 2018

Bo0km4n commented Jul 28, 2020

zentale commented Oct 13, 2020

soltysh commented Nov 26, 2020

Skybladev2 commented Jan 5, 2021 • edited

alculquicondor commented Jan 26, 2021

fejta-bot commented Apr 26, 2021

unixfox commented Apr 26, 2021

soltysh commented Jul 8, 2021

k8s-ci-robot commented Jul 8, 2021

hughesadam87 commented Feb 23, 2022 • edited

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
CronJob
ScheduledJob
"Too many missed start times to list"

lpreson commented Mar 8, 2017 •

edited

dzoeteman commented Apr 10, 2018 •

edited

soltysh commented Apr 18, 2018 •

edited

Skybladev2 commented Jan 5, 2021 •

edited

hughesadam87 commented Feb 23, 2022 •

edited