corvus | the web ui needs an update to deal with that: https://review.opendev.org/956205 | 00:00 |
---|---|---|
corvus | something strange happened with some requests after restarting with the ready node change. some requests selected sjc3 even though they couldn't use the ready nodes and sjc didn't have any excess capacity. i don't understand why that would have happened. | 00:52 |
corvus | to address the immediate issue, i wrote https://review.opendev.org/956206 and then to add a bit more debugging to try to figure out if there's something more fundamental wrong i wrote https://review.opendev.org/956207 | 00:52 |
corvus | because i think that requests like that could get permanently stuck (at least until the ready nodes are cleared out), i have live-patched an approximation of those changes into the launchers. we'll definitely want to get those (or something like it) merged soon. | 00:54 |
corvus | that does appear to have resolved the problem from what i can see | 00:54 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: Add default for build diskimage image name https://review.opendev.org/c/zuul/zuul-jobs/+/956219 | 08:44 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Replace eavesdrop01 with eavesdrop02 https://review.opendev.org/c/opendev/system-config/+/956122 | 14:22 |
fungi | corvus: looking at https://grafana.opendev.org/d/21a6e53ea4/zuul-status it seems like the launchers should be booting more nodes but aren't. i'm guessing we have some hidden quota pressure not apparent from statsd, but i'm going to check for errors in the debug logs first | 15:11 |
corvus | i think they filled their disk | 15:12 |
fungi | d'oh! | 15:12 |
fungi | yes, rootfs at 100% | 15:12 |
corvus | i'll clean that out | 15:12 |
fungi | 8.5gb of zuul logs on zl01 | 15:13 |
corvus | that's after i deleted logs | 15:13 |
fungi | oh wow | 15:13 |
fungi | the rootfs is only syslog is not tiny either | 15:14 |
fungi | s/the rootfs is only // | 15:14 |
fungi | 24gb in total for /var/log | 15:15 |
fungi | on a 38gb rootfs | 15:15 |
fungi | yeah, syslog is full of zuul-launcher tracebacks | 15:16 |
fungi | one random example was a quota exceeded for rackspace flex sjc3 | 15:18 |
fungi | looks like that's raising a ForbiddenException in the openstack sdk so the full traceback gets logged | 15:19 |
corvus | we don't need that going to syslog. any idea why it does? | 15:19 |
fungi | podman configuration? | 15:19 |
fungi | though i'm not immediately seeing it set in the compose file | 15:19 |
clarkb | we don't have an explicit journald log config in the docker compose file | 15:20 |
corvus | if you or someone else feel like looking into improving that, that would be great | 15:20 |
clarkb | but ya maybe podman by default emits to syslog/journald? | 15:20 |
fungi | yeah, i think rsyslogd is copying entries to there | 15:20 |
corvus | i'm going to delete some logs then restart the containers on the image build for change 956207 | 15:20 |
fungi | thanks corvus! | 15:20 |
fungi | clarkb: yeah, the journal is relatively huge too, so it's likely copying to both | 15:21 |
clarkb | the easiest solution may be to update the zuul logging config to stop writing to stdout/stderr | 15:21 |
fungi | so odds are we have 3x copies of everything the launchers are logging | 15:21 |
clarkb | since that would be where podman collects it to forward to journald/syslog | 15:21 |
corvus | oh is it only tracebacks showing up in syslog? | 15:22 |
fungi | i think so | 15:23 |
corvus | oh i think it may be any error message | 15:23 |
fungi | and messages about rsyslogd being overwhelmed/unable to write | 15:23 |
clarkb | logger_root writes to stdout. The explicit child loggers write to the log files (debug and non debug) | 15:24 |
clarkb | but we're also propagating the logs up to the root so everything goes to stdout | 15:24 |
corvus | maybe we can stop propagation of any messages we are explicitly handling, and still allow unhandled errors to go to stdout? | 15:24 |
fungi | i'm not sure if there's more errors aside from tracebacks, i was looking at a slice of the logs from back before the disk filled up and tracebacks were what i got | 15:24 |
clarkb | corvus: that seems reasonable I can work on a change for that | 15:24 |
fungi | could easily be other output landing there too | 15:24 |
corvus | yeah if they're coming from stdout, then i think it's any error message | 15:25 |
corvus | 2025-07-31 15:26:32,565 ERROR zuul.Launcher: openstack.exceptions.HttpException: HttpException: 500: Server Error for url: https://compute.public.mtl1.vexxhost.net/v2.1/86bbbcfa8ad043109d2d7af530225c72/servers/a2e228ac-0bbd-4295-aa4a-fa849766b144, Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. | 15:27 |
clarkb | corvus: fungi actually there are only zuul, gerrit, and gear loggers. I think we may also need to capture and not propagate openstack? | 15:27 |
clarkb | let me see if I can find the old nodepool logging config | 15:27 |
corvus | clarkb: i'm not seeing openstack.* logs on stdout | 15:28 |
clarkb | corvus: ok maybe we're reraising under zuul so everythign is qualified under zuul? what about requests and kazoo? those are the other two nodepool had | 15:29 |
corvus | haven't seen those yet. i did see paramiko though | 15:30 |
clarkb | ack. Do you think we should add them defensively or wiat for it to be a problem first? | 15:30 |
corvus | let's only add what we know we need so we don't copy nodepool cruft | 15:31 |
corvus | that vexxhost error may be a big part of this: that's being raised inside of a threadpoolexecutor, and i think something about how we're handling the error is causing an infinitely-growing traceback everytime it hits an exception handler | 15:31 |
corvus | now that i've restarted the launchers, i'm going to try to figure out how to handle that better | 15:32 |
fungi | yeah, the log entries i saw in random samples were all for zuul.Launcher | 15:34 |
opendevreview | Clark Boylan proposed opendev/system-config master: Don't propagate zuul-launcher logs written to disk https://review.opendev.org/c/opendev/system-config/+/956258 | 15:36 |
clarkb | we may want to do similar with other zuul services but this seems like a decent starting point. I wasn't sure what level to log paramiko at so I did warning | 15:36 |
fungi | i'm checking to see if i find anything else ending up in syslog, there's definitely stuff from paramiko.transport | 15:37 |
clarkb | fungi: what log levels are the paramiko.transport logs? and if lower than warning do you think we need to capture the messages? | 15:38 |
fungi | error | 15:38 |
fungi | so far | 15:38 |
clarkb | ok so warning is probably appropriate | 15:38 |
fungi | though i do see zuul.Launcher warnings landing in syslog too, not just error | 15:38 |
clarkb | fungi: zuul is being captured at a debug level | 15:38 |
fungi | WARNING keystoneauth.session: Failure: Unable to establish connection to ... | 15:39 |
clarkb | oh I see the console (stdout) logger is warning and above | 15:40 |
clarkb | so ya I think warning is safe to capture for everything that was propagating | 15:41 |
fungi | that's all i'm finding since the cleanup and restart | 15:41 |
clarkb | let me add keystoneauth to this change | 15:41 |
opendevreview | Clark Boylan proposed opendev/system-config master: Don't propagate zuul-launcher logs written to disk https://review.opendev.org/c/opendev/system-config/+/956258 | 15:42 |
fungi | prior to the cleanup, it's hard to mine things from the logs due to tracebacks being multi-line and not having any obvious prefix to filter on | 15:43 |
clarkb | fungi: the tracebacks would include the python file paths? | 15:46 |
fungi | and lines of code, and separate lines of visual underlining | 15:47 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/956260 Launcher: handle delete state machine errors [NEW] | 15:47 |
fungi | each step in the trace has a line for the file, a line for the code, and a line for the underlining | 15:47 |
clarkb | fungi: ya so you'd do something like grep /path/to/code -A 2 or something | 15:48 |
fungi | except i was trying to filter them out, not include them | 15:48 |
fungi | so i want to drop the subsequent 2 lines whenever the line matches | 15:49 |
fungi | there's probably a grep option for that, just need to revisit the manpage | 15:49 |
clarkb | corvus: re 956260 does that match nodepool's cleanup appraoch to failures to delete? I know that if things failed to delete at some point we'd create a new znode to represent it then try again. Was that in the cleanup phase and not the initial delete? | 15:50 |
clarkb | corvus: mostly wondering because I think the floating IP cleanup is still not working for that one floating ip in sjc3 and I don't want to rely on node cleanups here if we think there is a chance the leak detection is simply not working (I think its a different mechanism to fip cleanup though so its probably fine) | 15:51 |
corvus | we stopped doing that a long time ago with nodepool | 15:51 |
clarkb | corvus: ok so nodepool was doing pure leaked node cleanup like this change does in the event of a first failure? | 15:52 |
corvus | yep | 15:52 |
clarkb | ack in that case I'll assume it works and +2 the change. I do think fip cleanup happens via node listing (there is a comment in the code about this) so its in a different portion of the control flow too | 15:53 |
corvus | i don't know that the leaked resource cleanup is broken (aside from the report on that fip, but as you say, that's different. i haven't looked into it). | 15:53 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/956263 Openstack: don't log tracebacks for quota exceptions [NEW] | 15:57 |
corvus | hopefully those all merge before the disk fills up again :) | 15:59 |
corvus | btw i have restarted both zuul-web instances, so the "nodes" page should work now (after a reload) | 16:00 |
fungi | clarkb: not introduced by your change, but what does the gerrit logger do for zuul launchers? what logs as "gerrit" on those servers? | 16:01 |
clarkb | fungi: I'm not sure, but zuul does have valid gerrit logging in it elsewhere so I didn't want to remove it as it was already there | 16:01 |
clarkb | I think the schedulers are where you're actually going to encournter gerrit logs. The other components probably don't run that code at all | 16:02 |
corvus | yep, we could probably drop it | 16:04 |
corvus | the launchers appear to be handling more requests now | 16:09 |
fungi | clarkb: next hurdle for the eavesdrop replacement... looks like container names changed with the switch to podman maybe? or am i misreading https://zuul.opendev.org/t/openstack/build/bd9a6042d82c4983a103ed0453befd9b | 16:10 |
clarkb | fungi: oh yes, I got all of that working in a separate change earlier I think you just need to squash it in. Once sec while I dig that up | 16:11 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/955520 I think you can just copy that into your change and I can abandon my change | 16:11 |
fungi | aha, can do. thanks! this will make it a lot easier | 16:13 |
dpanech | Hi, some of hour reviews have been stuck for hours, eg: https://zuul.opendev.org/t/openstack/status?change=956113%2C4 | 16:16 |
fungi | dpanech: between ~13:15-15:30 utc we weren't booting new test nodes due to full filesystems on the launchers, but things should be running again now | 16:18 |
fungi | https://grafana.opendev.org/d/21a6e53ea4/zuul-status indicates there's no backlog of node requests now | 16:19 |
clarkb | also if you expand the job list and hover your mouse pointed on the status blob that says queued it will tell you where in the process the build is | 16:19 |
clarkb | says it is wiating on a node request and then you can go look at grafana and see are all the nodes being used, is something wrong, etc | 16:19 |
clarkb | and it just started | 16:19 |
fungi | yeah, looks like it took about an hour for the launchers to catch up on the pending backlog of requests after they got cleaned up | 16:21 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Replace eavesdrop01 with eavesdrop02 https://review.opendev.org/c/opendev/system-config/+/956122 | 16:25 |
fungi | i need to skip out shortly to an appointment, may be a couple of hours. will probably also disappear later for dinner, but am planning to focus on making sure the eavesdrop replacement change is ready for tomorrow | 16:26 |
dpanech | clarkb: fungi: they merged, thanks | 16:28 |
clarkb | fungi: I suspect the change is close if not ready now | 16:31 |
fungi | yeah, if system-config-run-eavesdrop works now, which i think it will finally this time, the rest should already be passing | 16:35 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Link VEXXHOST donor logo to testimonial article https://review.opendev.org/c/opendev/system-config/+/956268 | 16:53 |
fungi | ^ hot off the presses | 16:54 |
fungi | yay! system-config-run-eavesdrop is now passing on 956122, so should be all set for tomorrow | 17:11 |
clarkb | lgtm | 17:12 |
fungi | okay, headed out to appointment and then food, will be a few hours probably | 17:31 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/956289 Add unit tests (and fixes) for selectProviders | 18:40 |
corvus | that will make me feel a bit better about those ready node changes | 18:41 |
clarkb | corvus: I posted some questions to that change | 19:04 |
opendevreview | Merged opendev/system-config master: Don't propagate zuul-launcher logs written to disk https://review.opendev.org/c/opendev/system-config/+/956258 | 19:18 |
clarkb | corvus: ^ I think that requires a restart of the service to pick up but we can probably do that when 956289 lands? | 19:53 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!