Thursday, 2025-06-12

mossblaserHello all; jrosser helpfully spotted talk of the ubuntu/cloud-init bug 2113797 which appears to have reared its head. As a result of this I now have a couple of amphorae which are stuck in a BOOTING state (with the parent load balancer stuck in PENDING_CREATE) due to what I presume is the REVERT_ALL bug in taskflow #2043808 which we don't yet have patched in our deployment. Is there a more elegant workaround for this than manually15:02
mossblaserediting the states of the  amphora and LB in the database? (Thanks for your help!)15:02
johnsommossblaser This has nothing to do with taskflow. Canonical broke the ability to configure VMs with cloud-init on Jammy. 15:04
johnsommossblaser They have already released the fix. Also, they will eventually time out and the LB with move to error. It just depends on the timeout you have set. The default upstream is very long due to test gate issues.15:05
johnsommossblaser Also, that bug is closed as the fix was released15:06
mossblaserthanks! I think what I mean is that something like ~50% of the LBs created using the bad image have entered an ERROR state and can be cleaned up (after a rather long timeout as you say) but the other 50% have remained stuck for >24 hours at this point in some cases15:07
mossblaserand log messages seem suggestive that this is down to the taskflow bug15:08
mossblaserunfortunately we don't have the fixed taskflow version deployed15:08
johnsomThat should not happen. Open a bug with an example LB ID and your API/Worker/health manager logs.15:08
johnsomOh, well... You are using a new amphora image with an expired version of Taskflow?15:09
mossblaserwe're using 2024.1 (installed by OSA) and have a CI job which rebuilds a matching amphora image on a regular basis; but unfortunately the fixed taskflow (5.9.1, if I've understood) is not included in 2024.1 (it first appears in 2024.2)15:20
mossblaserjohnsom: Given the older versions we're using is the taskflow issue a plausible cause or would you still consider this worth reporting?15:43
johnsomOk, so 2024.1 is still maintained.15:49
johnsomIt looks like that patch was not backported, which we should do and cut a stable branch release.15:49
johnsomWould you be able to pick that up?15:49
gthiemongejohnsom: I can propose the backports15:50
johnsomFor those stuck in PENDING_CREATE, check all of the workers and health managers are not actively retrying them, then set them to ERROR and failover or delete. But be careful to check before just blanket moving them to error or you may end up with inconsistencies across the services.15:52
mossblasergthiemonge: thanks! (I'm afraid I don't have capacity at the moment)15:59
mossblaserjohnsom: thanks -- and to change their states, that's a bit of database surgery, right? or is there a command available to do force this?15:59
johnsomYes, it's DB. It should only VERY RARELY be done, is dangerous if you don't check the logs first, so there is no command.16:00
mossblaserthat was what I was expecting -- thanks very much :)16:01

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!