Tuesday, 2021-07-06

*** bhavikdbavishi1 is now known as bhavikdbavishi04:00
opendevreviewFelix Edel proposed zuul/zuul master: Move parent provider determination to pipeline manager  https://review.opendev.org/c/zuul/zuul/+/79763105:22
opendevreviewSimon Westphahl proposed zuul/zuul master: Move tenant validation to separate method  https://review.opendev.org/c/zuul/zuul/+/79946305:24
opendevreviewSimon Westphahl proposed zuul/zuul master: Combine full and smart reconfiguration events  https://review.opendev.org/c/zuul/zuul/+/79946405:24
*** tibeer[m] is now known as TimBeermann[m]06:29
*** jpena|off is now known as jpena06:58
*** rpittau|afk is now known as rpittau07:30
opendevreviewBenjamin Schanzel proposed zuul/zuul master: Add tenant name on NodeRequests for Nodepool  https://review.opendev.org/c/zuul/zuul/+/78868007:41
opendevreviewFelix Edel proposed zuul/zuul master: Move parent provider determination to pipeline manager  https://review.opendev.org/c/zuul/zuul/+/79763108:31
TimBeermann[m]y08:39
TimBeermann[m]y08:39
TimBeermann[m]y08:39
opendevreviewSimon Westphahl proposed zuul/zuul master: Refactor pipeline processing in run handler  https://review.opendev.org/c/zuul/zuul/+/79598509:10
opendevreviewSimon Westphahl proposed zuul/zuul master: Tenant, pipeline and queue locks in Zookeeper  https://review.opendev.org/c/zuul/zuul/+/79599309:10
opendevreviewMerged zuul/zuul master: Don't process build state watches w/o data  https://review.opendev.org/c/zuul/zuul/+/79946009:44
*** bhagyashris_ is now known as bhagyashris|ruck09:50
avass[m]I have a feeling `zuul-executor graceful` may not be working like it used to, at least it looks like it stopped working after a recent update and executors exit directly10:45
avass[m]only one out of three executors logged a graceful exit and we got a couple of lost jobs and some jobs with error status10:54
boha[m]<avass[m] "boha: i think we may have a patc"> Have you found time to look for a windows-friendly version of remove-build-sshkey?11:17
avass[m]boha: apparently we are using this: https://review.opendev.org/c/zuul/zuul-jobs/+/78066211:20
avass[m]:)11:20
avass[m]along with a vendored version of win_lineinfile to make sure newlines aren't removed at the end of the file11:21
boha[m]<avass[m] "boha: apparently we are using th"> Thanks, I'll give it a try. It seems I should search the zuul-jobs reviews for hidden gems more often ;-) 11:31
*** jpena is now known as jpena|lunch11:31
avass[m]boha: we try to push changes upstream when we go time :)11:33
tobiash[m]avass: regarding graceful, which version are you running?12:08
avass[m]tobiash: 4.6.0 i think12:18
tobiash[m]so before the execitor-api landed12:18
avass[m]Yeah, we haven't upgraded since the security release12:19
avass[m]I'm not 100% it's not kubernetes doing something strange yet however12:20
tobiash[m]k, we as well, thanks for the info, so we can be more cautious for now during operations12:20
*** jpena|lunch is now known as jpena12:37
opendevreviewSimon Westphahl proposed zuul/zuul master: Store tenant layout state in Zookeeper  https://review.opendev.org/c/zuul/zuul/+/77146113:49
corvusavass, tobiash: i'll look into graceful today14:21
avass[m]corvus: cool thanks14:28
tobiash[m]in principle graceful seems to still work for us but haven't checked for intermittend issues with it, so there might be some race condition14:47
fungiin opendev we just end up abruptly killing the executors... our nodes are quite stateless so we just end up with the running builds getting retried14:49
fungiso if graceful stop broke at some point, we probably wouldn't have noticed14:49
clarkbfungi: there was an issue with zuul executor shutdowns that we noticed though where instead of going into a retry state when executors are stopped the jobs went straight to error14:53
fungiyeah, that seemed to be related to being in between the gearman to zookeeper results migration14:54
corvusshould be back to retry now that we have the executor api in place14:54
avass[m]It could also be that we've changed something in our deployment that I've missed14:54
avass[m]We'd rather have a rolling upgrade that takes a while than retrying our jobs :)14:56
fungiyep, completely understand. they're clearly very different from our jobs14:56
avass[m]taking a closer look I think it can be on our end. We're installing a package before starting the executor and doing that through bash so maybe dumb-init and bash handles sigterm differently (or maybe something changed the sigterm behaviour for zuul executor?)15:06
corvusavass: what's your container command line (approximately)?15:11
avass[m]bash -c "pip install... && zuul-executor -f -d", as an argument to dumb-init (since we're not changing entrypoint)15:12
avass[m]I'm not sure if graceful shutdown stopped working before or after that was added however15:12
avass[m]oh, "/bin/sh -c" actually, not bash15:13
avass[m]aha, the most recent change that could have altered the behaviour was to change `command:` to `args:`, so maybe it started happening at that point since we used to overwrite dumb-init and /bin7sh was the entrypoint for a while. Though then we should have been seeing this behaviour since May. *shrug*15:20
*** marios is now known as marios|out16:04
*** jpena is now known as jpena|off16:14
*** rpittau is now known as rpittau|afk16:39
corvusavass: graceful under 4.6.0 appears to work as expected with my simple setup (even if i override the entrypoint or command when running under docker-compose).  it's not an exact replica, but i think it's pretty close.  i'm about to restest with master.16:52
avass[m]corvus: yeah I'm expecting it to be what i mentioned above and that we somehow weren't affected by it until now. I got a change that configures dumb-init to drop SIGTERM that I'm gonna test in a bit16:56
corvusavass: what's sending sigterm?17:02
avass[m]corvus: Kubernetes, we have a prestop hook that runs zuul-executor graceful17:03
corvusavass: do you need to adjust the timeout so k8s doesn't send sigterm?17:03
corvusoh, that's the signal that's before the grace period17:04
avass[m]Yeah17:04
avass[m]So that why we're dropping that and letting kubernetes sigkill if the graceful shutdown takes too long17:05
corvusavass: but sigterm should have always done a hard stop; how could it have behaved differently before?17:06
corvus(apparently if you run the executor in daemon mode, we don't set up the sigterm handler; but if it's run in foreground, sigterm=stop)17:07
avass[m]I'm guessing that /bin/sh was ignoring it, while dumb-init has a default behavior of stoping everything17:07
avass[m]That's my current theory that i have not yet tested however :)17:08
avass[m]But i do know that it used to work17:08
corvustobiash, tristanC: do either of you use graceful when restarting executors?  it looks to me like the only way to have that work in k8s is to do what avass is talking about (blocking sigterm in PID 1).17:12
corvusavass: i'm resetting my test now and will experiment17:12
avass[m]of course it also works if you actually wait for zuul-executor to exit in the prestop hook (which we're not doing)17:13
corvusoh that's an interesting approach too17:14
corvusavass: i confirmed that overriding the entrypoint to '/bin/sh -c' means that 'kill -SIGTERM 1' doesn't cause the executor to stop  (but that does work with the default entrypoint)17:16
corvusavass: so it sounds like you have 2 approaches for an immediate fix (block sigterm in pid 1, or wait in the prestop hook).  but maybe we should think about whether sigterm either should default to 'graceful' or that should be an option17:17
corvusi think sigterm=stop is desirable for opendev though (where restarting builds is not considered a big deal).  that means "docker-compose down" (or stop) matches the desired opendev behavior.17:19
corvusmaybe the best thing would be an environment variable, that way it's easy for anyone running containers to change the behavior to suit17:20
avass[m]corvus: or just add a `zuul-executor graceful -w` so the command waits until the executor stops17:20
corvusavass: yeah -- is that considered good or bad form in k8s?17:21
avass[m]seeing as kubernetes sends a sigterm after the prestop hook it seems like that's the way to do graceful shutdowns :)17:22
avass[m]of course the second option is like you say to change the behaviour of SIGTERM17:23
corvusi think i'm leaning slightly in that direction because it works the same way for k8s and docker-compose17:25
*** mgoddard- is now known as mgoddard17:26
avass[m]there's also `docker-compose down -t 0` if the container should shutdown directly17:27
corvusavass: yeah, but that's a sigkill as opposed to a zuul-executor stop17:27
corvuswhich are close, but not quite the same :)17:28
corvuszuul-executor stop being "immediately stop all running jobs and shutdown cleanly" and sigkill being "terminate the process and leave everything hanging in zk for the scheduler to clean up"17:29
avass[m]in that case I'm leaning towards `zuul-executor graceful -w` to make it easy, I'll configure our deployment to ignore SIGTERM for now17:29
corvusavass: sorry, you don't like my environment variable idea?17:29
avass[m]I just think the behaviour is more explicit if it's part of the command.17:31
avass[m]where an environment varialbe that switches SIGTERM to graceful shutdown could be set in a bunch of places17:32
corvusavass: with my suggestion, we'd also remove the need for you to set a prestop hook at all17:32
corvusreading the k8s docs, it doesn't necessarily sound like having a prestop wait is encouraged.  it doesn't look like it would be a problem, but they do say "Users should make their hook handlers as lightweight as possible. There are cases, however, when long running commands make sense, such as when saving state prior to stopping a Container."17:33
corvusit seems like just setting all of the prestop stuff aside and then setting ZUULEXECUTORSTOP_GRACEFUL=1 in the k8s yaml / docker-compose.yaml would be clear and meet everyone's needs17:34
corvusgrr17:34
tristanCcorvus: no we are not, sf-config relies on the default signal sent by systemd17:34
corvusZUUL_EXECUTOR_STOP_GRACEFUL=117:34
corvustristanC: so you terminate jobs immediately, like opendev?17:35
avass[m]yeah I'm not opposed to it, I just prefer `zuul-executor graceful -w` :)17:35
avass[m]both are fine17:35
corvusavass: ok, if you don't have a strong objection, i think i'll keep going with the env variable approach; i like it when k8s / non-k8s can use the same code paths17:36
avass[m]right, because there's no way to run a command before stopping in docker-compose17:37
corvusyeah; of course, we could put it in our playbooks; but this way it sort of means you can set the default for your environment17:39
tristanCcorvus: well we only restart executor along with the scheduler, to perform upgrade.17:42
tristanCthat being said, it would be nice to be able to let the services restart when they are idle17:42
fungisome day, it would be nice to elastically provision and add/remove executors on the fly as demand rises and falls17:44
fungi(and mergers)17:44
fungidoable today with some external orchestration17:45
corvuswe're close to being able to do that with the operator17:48
avass[m]It's very doable in kubernetes if you're fine with scaling on dumb metrics like cpu/memory etc17:48
corvussoon we'll have prometheus endpoints with queue size and number of builds17:51
avass[m]and suddenly you need an engineer to tune your PID controller that scales your infrastructure that supports your other engineers that are working with tuning PID controllers ;)17:54
mnaserhttps://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics is a fun read :)17:54
fungiavass[m]: yeah, we actually have some values emitted via statsd which would probably provide a good indication... like if you have too many or too few executors accepting jobs for a protracted period17:55
opendevreviewJames E. Blair proposed zuul/zuul master: Add graceful stop environment variable  https://review.opendev.org/c/zuul/zuul/+/79970118:04
corvusavass, tobiash, tristanC: ^ how's that look?18:04
corvusoh i forgot to git add the reno18:05
opendevreviewJames E. Blair proposed zuul/zuul master: Add graceful stop environment variable  https://review.opendev.org/c/zuul/zuul/+/79970118:05
avass[m]maybe do a case insensitive comparizon?18:06
corvusshould we call that ZUUL_EXECUTOR_SIGTERM_GRACEFUL instead?18:06
corvusavass: ++18:06
avass[m]otherwise lgtm18:06
avass[m]does that affect both `zuul-executor stop` and SIGTERM?18:06
opendevreviewJames E. Blair proposed zuul/zuul master: Add graceful stop environment variable  https://review.opendev.org/c/zuul/zuul/+/79970118:07
corvusno just sigterm18:07
opendevreviewJames E. Blair proposed zuul/zuul master: Add graceful stop environment variable  https://review.opendev.org/c/zuul/zuul/+/79970118:08
avass[m]`ZUUL_EXECUTOR_SIGTERM_GRACEFUL` sounds more specific in that case18:08
avass[m]:)18:08
corvusi think i talked myself into sigterm; i could see that being confusing18:08
tobiash[m]cool, so far we use graceful in a pre stop hook18:11
tobiash[m]Not having the need for the hook would nake the setup simpler18:12
corvustobiash: do you have the hook wait, or do you disable sigterm (maybe by overriding entrypoint to /bin/sh)?18:12
tobiash[m]I'd have to look, but I think the hook waits18:12
corvusack.  not too important, was mostly asking to make sure i understand and wasn't missing anything.  :)18:14
tobiash[m]corvus: my pre-stop hook is this: /bin/sh -c "zuul-executor graceful && sleep 86400"18:16
tobiash[m]actually just a hack, tell the executor to shutdown and wait for a day (which is our grace period)18:17
tobiash[m]as soon as the executor terminated itself the hook will be terminated as well18:17
avass[m]that's one way to do it :)18:20
corvusmakes sense18:20
clarkbtalking out loud here but why not use a different signal for graceful stops?18:42
clarkbI think that is how it worked once upon a time?18:42
fungii think we used to trigger that on hup or maybe usr118:43
clarkbya18:43
avass[m]kubernetes always sends SIGTERM and I don't think that's configurable, though it would be possible to rewrite it with dumb-init18:43
fungibut thought the signal handler was intentionally removed in favor of the rpc method18:43
fungiyeah, signal translation in the init sounds like voodoo, hopefully we won't go there ;)18:44
clarkbavass[m]: ah that is too bad. Even docker compose can change the signal :)18:44
fungii figure kubernetes was written for dataless applications. the entire idea of state much less graceful maintenance thereof during shutdown seems sorta bolted on to begin with18:45
SpamapShttps://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/ <-- this gives a good idea of what the genesis of Kubernetes was. But they sort of side-step whether or not state management was considered.20:09
SpamapSBut really, sending term, waiting, and then kill, is a pretty established pattern for supervisor processes and not a hard interface to fit into. Most app servers suck at being pid 1 anyway.20:11
SpamapSLikewise, you wouldn't run your DB server with init=/usr/bin/mysqld20:12
SpamapSAlso just reading a little into the zuul discussion...once zuul's state is always in ZK (How far along is that?) a graceful stop should be a lot shorter, no?20:18
SpamapSI guess executors are special in that they have a lot of stuff under them to finish running.20:19
SpamapSalso was there any discussion of this: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/  ?20:20
avass[m]spamaps: yeah, we're already using preStop hooks to gracefully shutdown, but apparently we weren't managing the SIGKILL that is sent after preStop correctly20:21
SpamapSAhh, if that preStop is already expected to take a day, wouldn't a kill deadline of a day be appropriate?20:22
avass[m]it would be cool if an executor could gracefully shutdown between playbooks instead of needing to complete the entire job, though that would take a bit more work20:22
SpamapSterminationGracePeriodSeconds: 86400, or something, no?20:23
avass[m]spamaps: kubernetes always sigterms directly after the preStop hook is executed, so the only way to avoid stopping is to make the preStop wait until the process exits or ignore the SIGTERM so kubernetes waits terminationGracePeriodSeconds until it sends a SIGKILL instead20:24
avass[m]or rather, directly after the preStop has completed20:24
SpamapSMy understanding is that it always sends a TERM if terminationGracePeriodSeconds hasn't been reached while waiting for preStop20:25
SpamapSso if you make the preStop wait long enough, you'll never get that TERM.20:25
avass[m]yeah that was the issue, we only ran `zuul-executor graceful` which doesn't block. but that used to work since our PID 1 was /bin/sh that ignored SIGTERM20:26
SpamapSAHHHH20:26
SpamapSgood ol /bin/sh, the worst/best init. ;)20:26
avass[m]:)20:26
* fungi still sometimes passes init=/bin/sh to the kernel under dire circumstances20:27
SpamapSI'm still of the opinion that expecting python not written to be pid==1, to behave like pid==1, is harder than simply running dumb-init or supervisord.20:27
avass[m]I mean it did work like we wanted it to when we had /bin/sh as PID 1 ;)20:27
SpamapS/bin/sh makes dumb-init looks smart.20:28
avass[m]now with corvus patch we can instead just remove the preStop hook and just set `ZUUL_EXECUTOR_SIGTERM_GRACEFUL=1`20:28
SpamapSYeah.. it's probably OK.. but I'm still nervous about making zuul-executor a pid==1. Maybe it's just me. I'm not running any zuul's in prod, so I am just projecting at this point.20:31
avass[m]Oh no, dumb-init will still be pid 120:32
SpamapSAh it was just passing it along. Which is appropriate.20:34
avass[m]The reason it was like that is because `command` in kubernetes is the entrypoint in docker. So instead of setting /bin/sh as an argument to dumb-init we had overridden the entrypoint. Once we fixed that our graceful shutdown broke since suddenly we stopped ignoring SIGTERM. 20:35
SpamapSIs it crazy to think that rather than a config option, zuul-executor look for a graceful shutdown, and if it's in progress, ignore TERM?20:35
fungithat doesn't sound entirely unreasonable20:36
SpamapSI can see that causing an issue if a server is expecting TERM to interrupt gracefuls.20:37
fungiisn't that what sigkill is for though?20:38
fungiterm is "please terminate now" while kill is "you're done, bye"20:38
fungigranted i still like hangup for graceful, i'm not really that invested in bsdisms any longer20:39
SpamapSYeah, that's what I would expect as well. However, the behavior has been "KTHXBAI" and it would change to "HMM? terminate? what? Nothing to terminate here."20:39
opendevreviewJames E. Blair proposed zuul/zuul-registry master: Add missing size to image configs  https://review.opendev.org/c/zuul/zuul-registry/+/79972623:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!