Thursday, 2025-07-31

corvusthe web ui needs an update to deal with that: https://review.opendev.org/95620500:00
corvussomething strange happened with some requests after restarting with the ready node change.  some requests selected sjc3 even though they couldn't use the ready nodes and sjc didn't have any excess capacity.  i don't understand why that would have happened.00:52
corvusto address the immediate issue, i wrote https://review.opendev.org/956206 and then to add a bit more debugging to try to figure out if there's something more fundamental wrong i wrote https://review.opendev.org/95620700:52
corvusbecause i think that requests like that could get permanently stuck (at least until the ready nodes are cleared out), i have live-patched an approximation of those changes into the launchers.  we'll definitely want to get those (or something like it) merged soon.00:54
corvusthat does appear to have resolved the problem from what i can see00:54
opendevreviewSimon Westphahl proposed zuul/zuul-jobs master: Add default for build diskimage image name  https://review.opendev.org/c/zuul/zuul-jobs/+/95621908:44
opendevreviewJeremy Stanley proposed opendev/system-config master: Replace eavesdrop01 with eavesdrop02  https://review.opendev.org/c/opendev/system-config/+/95612214:22
fungicorvus: looking at https://grafana.opendev.org/d/21a6e53ea4/zuul-status it seems like the launchers should be booting more nodes but aren't. i'm guessing we have some hidden quota pressure not apparent from statsd, but i'm going to check for errors in the debug logs first15:11
corvusi think they filled their disk15:12
fungid'oh!15:12
fungiyes, rootfs at 100%15:12
corvusi'll clean that out15:12
fungi8.5gb of zuul logs on zl0115:13
corvusthat's after i deleted logs15:13
fungioh wow15:13
fungithe rootfs is only syslog is not tiny either15:14
fungis/the rootfs is only //15:14
fungi24gb in total for /var/log15:15
fungion a 38gb rootfs15:15
fungiyeah, syslog is full of zuul-launcher tracebacks15:16
fungione random example was a quota exceeded for rackspace flex sjc315:18
fungilooks like that's raising a ForbiddenException in the openstack sdk so the full traceback gets logged15:19
corvuswe don't need that going to syslog. any idea why it does?15:19
fungipodman configuration?15:19
fungithough i'm not immediately seeing it set in the compose file15:19
clarkbwe don't have an explicit journald log config in the docker compose file15:20
corvusif you or someone else feel like looking into improving that, that would be great15:20
clarkbbut ya maybe podman by default emits to syslog/journald?15:20
fungiyeah, i think rsyslogd is copying entries to there15:20
corvusi'm going to delete some logs then restart the containers on the image build for change 95620715:20
fungithanks corvus!15:20
fungiclarkb: yeah, the journal is relatively huge too, so it's likely copying to both15:21
clarkbthe easiest solution may be to update the zuul logging config to stop writing to stdout/stderr15:21
fungiso odds are we have 3x copies of everything the launchers are logging15:21
clarkbsince that would be where podman collects it to forward to journald/syslog15:21
corvusoh is it only tracebacks showing up in syslog?15:22
fungii think so15:23
corvusoh i think it may be any error message15:23
fungiand messages about rsyslogd being overwhelmed/unable to write15:23
clarkblogger_root writes to stdout. The explicit child loggers write to the log files (debug and non debug)15:24
clarkbbut we're also propagating the logs up to the root so everything goes to stdout15:24
corvusmaybe we can stop propagation of any messages we are explicitly handling, and still allow unhandled errors to go to stdout?15:24
fungii'm not sure if there's more errors aside from tracebacks, i was looking at a slice of the logs from back before the disk filled up and tracebacks were what i got15:24
clarkbcorvus: that seems reasonable I can work on a change for that15:24
fungicould easily be other output landing there too15:24
corvusyeah if they're coming from stdout, then i think it's any error message15:25
corvus2025-07-31 15:26:32,565 ERROR zuul.Launcher:   openstack.exceptions.HttpException: HttpException: 500: Server Error for url: https://compute.public.mtl1.vexxhost.net/v2.1/86bbbcfa8ad043109d2d7af530225c72/servers/a2e228ac-0bbd-4295-aa4a-fa849766b144, Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.15:27
clarkbcorvus: fungi actually there are only zuul, gerrit, and gear loggers. I think we may also need to capture and not propagate openstack?15:27
clarkblet me see if I can find the old nodepool logging config15:27
corvusclarkb: i'm not seeing openstack.* logs on stdout15:28
clarkbcorvus: ok maybe we're reraising under zuul so everythign is qualified under zuul? what about requests and kazoo? those are the other two nodepool had15:29
corvushaven't seen those yet.  i did see paramiko though15:30
clarkback. Do you think we should add them defensively or wiat for it to be a problem first?15:30
corvuslet's only add what we know we need so we don't copy nodepool cruft15:31
corvusthat vexxhost error may be a big part of this: that's being raised inside of a threadpoolexecutor, and i think something about how we're handling the error is causing an infinitely-growing traceback everytime it hits an exception handler15:31
corvusnow that i've restarted the launchers, i'm going to try to figure out how to handle that better15:32
fungiyeah, the log entries i saw in random samples were all for zuul.Launcher15:34
opendevreviewClark Boylan proposed opendev/system-config master: Don't propagate zuul-launcher logs written to disk  https://review.opendev.org/c/opendev/system-config/+/95625815:36
clarkbwe may want to do similar with other zuul services but this seems like a decent starting point. I wasn't sure what level to log paramiko at so I did warning15:36
fungii'm checking to see if i find anything else ending up in syslog, there's definitely stuff from paramiko.transport15:37
clarkbfungi: what log levels are the paramiko.transport logs? and if lower than warning do you think we need to capture the messages?15:38
fungierror15:38
fungiso far15:38
clarkbok so warning is probably appropriate15:38
fungithough i do see zuul.Launcher warnings landing in syslog too, not just error15:38
clarkbfungi: zuul is being captured at a debug level15:38
fungiWARNING keystoneauth.session: Failure: Unable to establish connection to ...15:39
clarkboh I see the console (stdout) logger is warning and above15:40
clarkbso ya I think warning is safe to capture for everything that was propagating15:41
fungithat's all i'm finding since the cleanup and restart15:41
clarkblet me add keystoneauth to this change15:41
opendevreviewClark Boylan proposed opendev/system-config master: Don't propagate zuul-launcher logs written to disk  https://review.opendev.org/c/opendev/system-config/+/95625815:42
fungiprior to the cleanup, it's hard to mine things from the logs due to tracebacks being multi-line and not having any obvious prefix to filter on15:43
clarkbfungi: the tracebacks would include the python file paths?15:46
fungiand lines of code, and separate lines of visual underlining15:47
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/956260 Launcher: handle delete state machine errors [NEW]        15:47
fungieach step in the trace has a line for the file, a line for the code, and a line for the underlining15:47
clarkbfungi: ya so you'd do something like grep /path/to/code -A 2 or something15:48
fungiexcept i was trying to filter them out, not include them15:48
fungiso i want to drop the subsequent 2 lines whenever the line matches15:49
fungithere's probably a grep option for that, just need to revisit the manpage15:49
clarkbcorvus: re 956260 does that match nodepool's cleanup appraoch to failures to delete? I know that if things failed to delete at some point we'd create a new znode to represent it then try again. Was that in the cleanup phase and not the initial delete?15:50
clarkbcorvus: mostly wondering because I think the floating IP cleanup is still not working for that one floating ip in sjc3 and I don't want to rely on node cleanups here if we think there is a chance the leak detection is simply not working (I think its a different mechanism to fip cleanup though so its probably fine)15:51
corvuswe stopped doing that a long time ago with nodepool15:51
clarkbcorvus: ok so nodepool was doing pure leaked node cleanup like this change does in the event of a first failure?15:52
corvusyep15:52
clarkback in that case I'll assume it works and +2 the change. I do think fip cleanup happens via node listing (there is a comment in the code about this) so its in a different portion of the control flow too15:53
corvusi don't know that the leaked resource cleanup is broken (aside from the report on that fip, but as you say, that's different.  i haven't looked into it).15:53
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/956263 Openstack: don't log tracebacks for quota exceptions [NEW]        15:57
corvushopefully those all merge before the disk fills up again :)15:59
corvusbtw i have restarted both zuul-web instances, so the "nodes" page should work now (after a reload)16:00
fungiclarkb: not introduced by your change, but what does the gerrit logger do for zuul launchers? what logs as "gerrit" on those servers?16:01
clarkbfungi: I'm not sure, but zuul does have valid gerrit logging in it elsewhere so I didn't want to remove it as it was already there16:01
clarkbI think the schedulers are where you're actually going to encournter gerrit logs. The other components probably don't run that code at all16:02
corvusyep, we could probably drop it16:04
corvusthe launchers appear to be handling more requests now16:09
fungiclarkb: next hurdle for the eavesdrop replacement... looks like container names changed with the switch to podman maybe? or am i misreading https://zuul.opendev.org/t/openstack/build/bd9a6042d82c4983a103ed0453befd9b16:10
clarkbfungi: oh yes, I got all of that working in a separate change earlier I think you just need to squash it in. Once sec while I dig that up16:11
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/955520 I think you can just copy that into your change and I can abandon my change16:11
fungiaha, can do. thanks! this will make it a lot easier16:13
dpanechHi, some of hour reviews have been stuck for hours, eg: https://zuul.opendev.org/t/openstack/status?change=956113%2C416:16
fungidpanech: between ~13:15-15:30 utc we weren't booting new test nodes due to full filesystems on the launchers, but things should be running again now16:18
fungihttps://grafana.opendev.org/d/21a6e53ea4/zuul-status indicates there's no backlog of node requests now16:19
clarkbalso if you expand the job list and hover your mouse pointed on the status blob that says queued it will tell you where in the process the build is16:19
clarkbsays it is wiating on a node request and then you can go look at grafana and see are all the nodes being used, is something wrong, etc16:19
clarkband it just started16:19
fungiyeah, looks like it took about an hour for the launchers to catch up on the pending backlog of requests after they got cleaned up16:21
opendevreviewJeremy Stanley proposed opendev/system-config master: Replace eavesdrop01 with eavesdrop02  https://review.opendev.org/c/opendev/system-config/+/95612216:25
fungii need to skip out shortly to an appointment, may be a couple of hours. will probably also disappear later for dinner, but am planning to focus on making sure the eavesdrop replacement change is ready for tomorrow16:26
dpanechclarkb: fungi: they merged, thanks16:28
clarkbfungi: I suspect the change is close if not ready now16:31
fungiyeah, if system-config-run-eavesdrop works now, which i think it will finally this time, the rest should already be passing16:35
opendevreviewJeremy Stanley proposed opendev/system-config master: Link VEXXHOST donor logo to testimonial article  https://review.opendev.org/c/opendev/system-config/+/95626816:53
fungi^ hot off the presses16:54
fungiyay! system-config-run-eavesdrop is now passing on 956122, so should be all set for tomorrow17:11
clarkblgtm17:12
fungiokay, headed out to appointment and then food, will be a few hours probably17:31
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/956289 Add unit tests (and fixes) for selectProviders        18:40
corvusthat will make me feel a bit better about those ready node changes18:41
clarkbcorvus: I posted some questions to that change19:04
opendevreviewMerged opendev/system-config master: Don't propagate zuul-launcher logs written to disk  https://review.opendev.org/c/opendev/system-config/+/95625819:18
clarkbcorvus: ^ I think that requires a restart of the service to pick up but we can probably do that when 956289 lands?19:53

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!