Monday, 2022-10-24

-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/nodepool] 853168: Add config option to limit ephemeral storage on K8s Pod labels https://review.opendev.org/c/zuul/nodepool/+/85316806:49
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/nodepool] 851988: Implement cleanup of leaked resources in k8s driver https://review.opendev.org/c/zuul/nodepool/+/85198811:21
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 862500: Add debug logs to the static driver https://review.opendev.org/c/zuul/nodepool/+/86250015:41
@clarkb:matrix.orgcorvus: ^ and https://review.opendev.org/c/zuul/nodepool/+/862414/ are probably both good to go. I'll WIP the other change until we can catch it failing with the better debug logs15:41
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 862500: Add debug logs to the static driver https://review.opendev.org/c/zuul/nodepool/+/86250017:56
@clarkb:matrix.orgslight improvement to the logs since the check jobs were already failing17:56
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 860989: Remove Ansible 5 https://review.opendev.org/c/zuul/zuul/+/86098918:25
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 862522: WIP: openstack statemachine https://review.opendev.org/c/zuul/nodepool/+/86252219:44
@clarkb:matrix.orgcorvus: I think I understand the static driver issue a lot better now (the extra debug logging did help). We're stopping the old provider manager after creating the new provider manager. This allows the old provider manager's syncNodeCount() to be called from the cleanup routines for the old manager after the new manager is created. If this happens and we idle then nothing triggers syncNodeCount() to reconcile the resulting delta20:25
@clarkb:matrix.orgI think an easy fix here is to stop provider managers before starting them. Does anyone know why we don't do this already?20:25
@jim:acmegating.comClark: because in a cloud, it would appear to be a huge outage20:28
@jim:acmegating.comit's technically possible for a provider to take hours to stop.20:29
@jim:acmegating.comthat's worst-case, but even under typical cases, that could be *many* minutes for a cloud like rax20:29
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 862526: Stop provider managers before starting new ones https://review.opendev.org/c/zuul/nodepool/+/86252620:31
@clarkb:matrix.orgThats the naive swap of ordering.20:32
@clarkb:matrix.orgif nothing else it will illustrate the problem I hope. It does seem like maybe the static driver needs some more global locking or perhaps a less forceful stop() call we can call first?20:33
@clarkb:matrix.orgmaybe it is ok to stop the periodic tasks, but let existing boots complete?20:33
@jim:acmegating.comClark: is that calling this method? https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/static/provider.py#L40920:37
@clarkb:matrix.orgoh wow we don't even stop the driver with the stop() method?20:38
@clarkb:matrix.orgThe problem is https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/static/provider.py#L436-L449 is being called via periodic tasks on the old manager after the new manager's start() method has been called. This effectively overwrites the info in zk for the new config immediately after it is written20:39
@clarkb:matrix.orgyou can see that in this log here: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_904/862500/1/check/nodepool-tox-py38/9042733/testr_results.html starting roughly with `2022-10-24 17:28:09,844 DEBUG [nodepool.driver.static.StaticNodeProvider] Starting static provider static-provider`20:40
@clarkb:matrix.orgcorvus: reading more it appears that the cleanup threads lookup the provider managers out of the config. But we don't swap the configs out until after starting then "stopping" the old managers20:52
@clarkb:matrix.orgmaybe we can put a lock around that?20:53
@jim:acmegating.comClark: that sounds good20:57
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 862526: Lock around config updates and provider manager usage https://review.opendev.org/c/zuul/nodepool/+/86252621:19
@clarkb:matrix.orgcorvus: ^ something like that maybe? I'm not entirely sure I've locked around all the necessary areas. I used getProviderManager() calls as my clues.21:19
@jim:acmegating.comClark: is assignhandlers necessary?21:26
@jim:acmegating.comClark: i think the assignhandlers lock is going to cause similar problems to the reordering.  it still feels like the tail wagging the dog.  it would be great if the fix for this could be in the static driver.21:32
@jim:acmegating.comClark: i think i like your idea of a new stop method.  that would contain the impact to just the static driver.21:39
@clarkb:matrix.orgI think part of the problem here is due to the assumption that state is external for things like resource cleanup21:48
@clarkb:matrix.orgwhich means that we can safely invoke cleanups against the old or new config and there isn't a difference21:48
@jim:acmegating.comyep, it's absolutely the case that the static driver does not fit the model21:49
@jim:acmegating.comwhich is why i want to avoid changing too much of nodepool to accommodate it.  i'd rather have an extra stop method that no other driver has, than to shutdown a busy system for every reconfiguration (which happens continuously)21:51
@clarkb:matrix.orgya that seems fine. Another thought I've had is wondering if we can separate the state from the reconfiguration and let it persist somehow, but I think that is basically wha we have now and its the frame of reference for operating on the staet that is confused21:52
@jim:acmegating.comyep21:53
@clarkb:matrix.orgalright let me see about adding a provider.idle() call we can do21:54
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 862526: Add idle state to driver providers https://review.opendev.org/c/zuul/nodepool/+/86252622:10
@clarkb:matrix.orgI think that does it generically enough that if other drivers needed to have a similar hook point they can take advantage of it while still keeping the delta here to a minimum to reduce the chances of holding up slow drivers22:11
@jim:acmegating.comwell, it's not just slow drivers i'm worried about, it's busy ones :)22:15
@clarkb:matrix.orgheh there is abug in my change. Yay for testing. But also corvus do you know if TestDriver in the testsuite is used for anything more?22:23
@jim:acmegating.comClark: i'm not sure that entire driver is used for anything22:27
@clarkb:matrix.orgya I'm looking at it and I can't see where it gets loaded. I'll push a followup change to delete it and see if it breaks the tests22:27
@jim:acmegating.comClark: oh i think i know22:28
@jim:acmegating.comClark: it's a minimalist driver for interactive testing of the launcher itself.22:28
@jim:acmegating.comClark: let's keep it and just add the method22:28
@clarkb:matrix.orgack22:29
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 862526: Add idle state to driver providers https://review.opendev.org/c/zuul/nodepool/+/86252622:30
@clarkb:matrix.orgI did eventually find some configs that set `driver: test` so it is used22:31

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!