KEP for Graduating CronJob to GA #978

barney-s · 2019-04-20T16:31:16Z

KEP for graduating CronJob to GA

k8s-ci-robot · 2019-04-20T16:31:23Z

Hi @barney-s. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

justaugustus · 2019-04-28T02:49:38Z

/ok-to-test
/assign @kow3ns @liggitt

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

fejta-bot · 2019-09-11T00:30:16Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

k8s-ci-robot · 2019-09-19T06:31:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: barney-s
To complete the pull request process, please assign kow3ns
You can assign the PR to them by writing /assign @kow3ns in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dims · 2019-09-19T11:17:18Z

@barney-s please see kubernetes/kubernetes#82659 for list of things that needs to be fixed.

And subsequent comments

sftim · 2020-03-11T11:15:27Z

CronJobs use the TZ of the controller-manager (master).

The trouble is, someone with just API access has no idea what that TZ is. Sometimes cluster operators too, who run the controller manager in a container with a UTC timezone on a host with a non-UTC timezone.

The main part of my concern I have is people with just API access who want to predict when their job will run. Making the API stable is a promise that the snags are sorted (or sorted enough), so I want to make sure people who are deciding on this can have regard to this particular snag. It's a thing that Kubernetes users often seem to run into.

I'd like to address that challenge before GA, by adding a timezone field to each CronJob.

and then, for the initial implementation, constrain the value for timezone to either:

["UTC"]
- easier to implement, but annoying for people who explicitly want a different timezone for their CronJobs
any valid entry from https://www.iana.org/time-zones
- convenient, because it lets you mix CronJobs with a UTC zone with CronJobs that take account of local daylight-savings changes etc.

I can make a separate PR against the KEP if other people feel this is an improvement worth doing.

barney-s · 2020-03-11T22:00:55Z

The trouble is, someone with just API access has no idea what that TZ is. Sometimes cluster operators too, who run the controller manager in a container with a UTC timezone on a host with a non-UTC timezone.

The main part of my concern I have is people with just API access who want to predict when their job will run. Making the API stable is a promise that the snags are sorted (or sorted enough), so I want to make sure people who are deciding on this can have regard to this particular snag. It's a thing that Kubernetes users often seem to run into.

I'd like to address that challenge before GA, by adding a timezone field to each CronJob.

I see 2 issues.

User with just API access don't know the timezone of the master.
Clusters across different env have different timezones. There is no known recommendation or conformance test to define timezone for the master. Different distributions may have different behavior.

For 1, Can it be fixed by asking the administrator for the timezone of the master component ?

For 2, TBH i don't know if we have a conformance recommendation for timezone of a master. This causes the same set of cronjobs to behave differently across clusters from different providers.

Iam thinking per CronJob timezone may be an overkill. Instead it should be a cluster level config or a recommendation.

That being said i will reach out to some more folks.

@liggitt - Would you have any thoughts on this ?

fredsted · 2020-03-12T10:13:38Z

Was asked to comment from a PR over at the main repo. After reading the KEP, my impression that this initiative is mostly geared towards scaling, performance and monitoring, which is of course welcome. However, I'm not seeing initiatives that fixes the "100 missed start times" limit. Therefore I would suggest that the aforementioned PR is added to the list of "Fix applicable open issues" list, or at least that some attention is given to it. It's the only issue I've had with Kubernetes CronJobs so far.

sftim · 2020-03-12T11:41:04Z

asking the administrator for the timezone of the master component

Question I'd ask is: once this is declared stable, does that make it harder to remove that requirement for an out-of-band conversation?
If it's easy to fix even after the feature is stable, without an impact on cluster operators etc, then it's OK to defer the fix.
On the other hand if a post-stabilization fix is hard then it should land before GA.

barney-s · 2020-03-12T17:37:33Z

@sftim please recheck. i have added timezone field.

sftim · 2020-03-12T19:43:29Z

@barney-s thanks, that's addressed my concerns. It means that someone who wants to can write an admission controller to default to UTC or to reject CronJobs that don't specify a whitelisted timezone.

vllry · 2020-03-17T05:51:06Z

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

+##### Multiple workers
+We also propose to have multiple workers controller by a flag similar to [statefulset controller](https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/apps.go#L65). The default would be set to 5 similar to [statefulset](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/config/v1alpha1/defaults.go#L34)
+
+##### Handling Cron aspect


There is a critical problem with the current implementation that I would like to call out.

When determining if a cron should be scheduled, getMissedSchedules() is used to return the most recent missed schedule time (and a count of missed start times, in the form of extra items in the returned slice). https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/utils.go#L92

This code uses a 3rd party module that has a Next() function for schedules, but not a Previous(). Rather than looking back at now.Previous(), the code repeatedly calls Next(), until hitting reaching the current time... or hitting a hard cap. I have explored 3 ways to remove the hard cap, ranked in decreasing order of appeal as I understand them:

Implement a Previous function, by working around the 3rd party library, or by outright replacing it.

Use a binary search with the Next() function, to avoid iterating over every start time in the checked time window. For any subwindow, if there is a start time in the second half of the window, the most recent missed start time is at least at that time.

Remove the concept of "which start time" the job is launched on behalf of. If we only need to know that the Cron Job is due for a run (and is within its startingDeadlineSeconds), we can avoid this pattern. This seems the least ideal due to the decrease in clear status.

We are implementing option 2 at Lyft (as it's far easier to test and have fast confidence in than option 1). 100 missed start times is nowhere near long enough to support our production Cron users, given the myriad of ways that code or the CronJob/Kubernetes machinery can break.

@vllry - Thanks for this feedback. Are you using a custom version of CronJob or a separate controller ? Is that public ?

We are interested in the myriad ways the machinery can break.

We are patching the upstream controller (I can open an open source PR soon if you're interested, but it's not yet deployed/proven internally). We considered writing a new controller but are hoping this effort moves forward soon.

We have ~250 CronJobs at present, which entails hitting (a) a lot of infrequent edge cases, and (b) being extremely susceptible to the performance issues outlined in the KEP and my specific comment.

vllry · 2020-03-17T05:58:39Z

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

+nextScheduleTime += jitter
+```
+
+### Support Timezone for cronjobs


Suggestion: this may be better suited for a v2. This feels easy to get wrong, as a number of other commenters have suggested. It would be unfortunate to rush a design to v1 while trying to promote the other improvements.

It would be unfortunate to rush a design to v1 while trying to promote the other improvements.

If rushing is a concern, I'd prefer to extend the beta. Add .spec.timezone along with other tidying, and try that out for a release cycle (or more).

I do see a use case for this. Especially with multi cluster scenarios. Having a schedule with local timezone or UTC (depending on master VM/pod) would be good for most use cases. Having a deterministic schedule may be preferable for others.

But i do agree we need to get some consensus on the priority and when it needs to be done.

I would suggest moving it to a separate KEP. I agree it's potentially controversial, while the overall idea behind this KEP isn't. I don't want to have this KEP stuck for too long (it's already hanging too long).

wojtek-t

LGTM from the scalability perspective.

wojtek-t · 2020-03-27T20:13:08Z

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

+nextScheduleTime += jitter
+```
+
+### Support Timezone for cronjobs


I would suggest moving it to a separate KEP. I agree it's potentially controversial, while the overall idea behind this KEP isn't. I don't want to have this KEP stuck for too long (it's already hanging too long).

wojtek-t · 2020-03-27T20:17:33Z

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md

+- CronJob was introduced in Kubernetes 1.3 as ScheduledJobs
+- In Kubernetes 1.8 it was renamed to CronJob and promoted to Beta
+
+## Alternatives and Further Reading


Putting my "production readiness review" hat, you will also need to fill in the PRR questionaire:
#1620

fejta-bot · 2020-06-25T21:18:11Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

riking · 2020-06-29T06:09:10Z

Still needs work to chase up loose ends.

soltysh · 2020-07-15T15:01:56Z

@barney-s are you still working on this one @alaypatel07 and I are working on updating the controller currently, I'd like to take over this KEP and update and finish it, if you don't mind.

fejta-bot · 2020-08-14T15:40:54Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

soltysh · 2020-09-03T12:46:12Z

/remove-lifecycle rotten

krmayankk · 2020-09-21T16:52:06Z

/assign

wojtek-t · 2020-09-22T08:31:12Z

Closing in favor of #1996

KEP for Graduating CronJob to GA

dfe05d5

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 20, 2019

k8s-ci-robot requested review from kow3ns and prydonius April 20, 2019 16:31

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Apr 20, 2019

Joseph-Irving reviewed Apr 23, 2019

View reviewed changes

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned kow3ns and liggitt Apr 28, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2019

liggitt reviewed May 6, 2019

View reviewed changes

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Outdated Show resolved Hide resolved

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Outdated Show resolved Hide resolved

mattfarina reviewed May 6, 2019

View reviewed changes

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Outdated Show resolved Hide resolved

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Outdated Show resolved Hide resolved

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Outdated Show resolved Hide resolved

soltysh mentioned this pull request May 14, 2019

When the number of jobs exceeds 500, cronjob cannot schedule, bug of pager.List kubernetes/kubernetes#77465

Closed

janetkuo reviewed Jun 12, 2019

View reviewed changes

keps/sig-apps/20190318-Graduate-CronJob-to-Stable.md Show resolved Hide resolved

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2019

soltysh requested changes Sep 11, 2019

View reviewed changes

Addressing first set of comments.

e5dbdf1

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2019

dims mentioned this pull request Sep 19, 2019

Drive CronJob from Beta to GA kubernetes/kubernetes#82659

Closed

barney-s added 2 commits September 24, 2019 09:37

Add Enhancements and Bugs to be resolved for GA

a5a08c2

Add Scaling section

54a3de4

Addressing sig-scalability review comments

ea1088a

And subsequent comments

barney-s force-pushed the barney-s-conjob-ga branch from e1b5dca to ea1088a Compare March 11, 2020 05:10

barney-s mentioned this pull request Mar 11, 2020

Fix CronJob missed start time handling kubernetes/kubernetes#81557

Closed

Add support for timezone

5d72d96

barney-s requested a review from wojtek-t March 13, 2020 17:32

vllry reviewed Mar 17, 2020

View reviewed changes

liggitt mentioned this pull request Mar 20, 2020

Require Transition from Beta #1266

Merged

wojtek-t reviewed Mar 27, 2020

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2020

nugasescuserbun mentioned this pull request Jul 2, 2020

Cronjob controller doesn't honor defined schedules on big clusters kubernetes/kubernetes#90801

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 3, 2020

k8s-ci-robot assigned krmayankk Sep 21, 2020

wojtek-t closed this Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP for Graduating CronJob to GA #978

KEP for Graduating CronJob to GA #978

barney-s commented Apr 20, 2019

k8s-ci-robot commented Apr 20, 2019

justaugustus commented Apr 28, 2019

fejta-bot commented Sep 11, 2019

k8s-ci-robot commented Sep 19, 2019

dims commented Sep 19, 2019

sftim commented Mar 11, 2020

barney-s commented Mar 11, 2020 •

edited

fredsted commented Mar 12, 2020

sftim commented Mar 12, 2020

barney-s commented Mar 12, 2020

sftim commented Mar 12, 2020

vllry Mar 17, 2020 •

edited

barney-s Mar 18, 2020

vllry Mar 18, 2020

vllry Mar 17, 2020

sftim Mar 17, 2020

barney-s Mar 18, 2020

wojtek-t Mar 27, 2020

wojtek-t left a comment

wojtek-t Mar 27, 2020

wojtek-t Mar 27, 2020

fejta-bot commented Jun 25, 2020

riking commented Jun 29, 2020

soltysh commented Jul 15, 2020

fejta-bot commented Aug 14, 2020

soltysh commented Sep 3, 2020

krmayankk commented Sep 21, 2020

wojtek-t commented Sep 22, 2020

KEP for Graduating CronJob to GA #978

KEP for Graduating CronJob to GA #978

Conversation

barney-s commented Apr 20, 2019

k8s-ci-robot commented Apr 20, 2019

justaugustus commented Apr 28, 2019

fejta-bot commented Sep 11, 2019

k8s-ci-robot commented Sep 19, 2019

dims commented Sep 19, 2019

sftim commented Mar 11, 2020

barney-s commented Mar 11, 2020 • edited

fredsted commented Mar 12, 2020

sftim commented Mar 12, 2020

barney-s commented Mar 12, 2020

sftim commented Mar 12, 2020

vllry Mar 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fejta-bot commented Jun 25, 2020

riking commented Jun 29, 2020

soltysh commented Jul 15, 2020

fejta-bot commented Aug 14, 2020

soltysh commented Sep 3, 2020

krmayankk commented Sep 21, 2020

wojtek-t commented Sep 22, 2020

barney-s commented Mar 11, 2020 •

edited

vllry Mar 17, 2020 •

edited