Thursday, 2025-07-31

corvus	the web ui needs an update to deal with that: https://review.opendev.org/956205	00:00
corvus	something strange happened with some requests after restarting with the ready node change. some requests selected sjc3 even though they couldn't use the ready nodes and sjc didn't have any excess capacity. i don't understand why that would have happened.	00:52
corvus	to address the immediate issue, i wrote https://review.opendev.org/956206 and then to add a bit more debugging to try to figure out if there's something more fundamental wrong i wrote https://review.opendev.org/956207	00:52
corvus	because i think that requests like that could get permanently stuck (at least until the ready nodes are cleared out), i have live-patched an approximation of those changes into the launchers. we'll definitely want to get those (or something like it) merged soon.	00:54
corvus	that does appear to have resolved the problem from what i can see	00:54
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: Add default for build diskimage image name https://review.opendev.org/c/zuul/zuul-jobs/+/956219	08:44
opendevreview	Jeremy Stanley proposed opendev/system-config master: Replace eavesdrop01 with eavesdrop02 https://review.opendev.org/c/opendev/system-config/+/956122	14:22
fungi	corvus: looking at https://grafana.opendev.org/d/21a6e53ea4/zuul-status it seems like the launchers should be booting more nodes but aren't. i'm guessing we have some hidden quota pressure not apparent from statsd, but i'm going to check for errors in the debug logs first	15:11
corvus	i think they filled their disk	15:12
fungi	d'oh!	15:12
fungi	yes, rootfs at 100%	15:12
corvus	i'll clean that out	15:12
fungi	8.5gb of zuul logs on zl01	15:13
corvus	that's after i deleted logs	15:13
fungi	oh wow	15:13
fungi	the rootfs is only syslog is not tiny either	15:14
fungi	s/the rootfs is only //	15:14
fungi	24gb in total for /var/log	15:15
fungi	on a 38gb rootfs	15:15
fungi	yeah, syslog is full of zuul-launcher tracebacks	15:16
fungi	one random example was a quota exceeded for rackspace flex sjc3	15:18
fungi	looks like that's raising a ForbiddenException in the openstack sdk so the full traceback gets logged	15:19
corvus	we don't need that going to syslog. any idea why it does?	15:19
fungi	podman configuration?	15:19
fungi	though i'm not immediately seeing it set in the compose file	15:19
clarkb	we don't have an explicit journald log config in the docker compose file	15:20
corvus	if you or someone else feel like looking into improving that, that would be great	15:20
clarkb	but ya maybe podman by default emits to syslog/journald?	15:20
fungi	yeah, i think rsyslogd is copying entries to there	15:20
corvus	i'm going to delete some logs then restart the containers on the image build for change 956207	15:20
fungi	thanks corvus!	15:20
fungi	clarkb: yeah, the journal is relatively huge too, so it's likely copying to both	15:21
clarkb	the easiest solution may be to update the zuul logging config to stop writing to stdout/stderr	15:21
fungi	so odds are we have 3x copies of everything the launchers are logging	15:21
clarkb	since that would be where podman collects it to forward to journald/syslog	15:21
corvus	oh is it only tracebacks showing up in syslog?	15:22
fungi	i think so	15:23
corvus	oh i think it may be any error message	15:23
fungi	and messages about rsyslogd being overwhelmed/unable to write	15:23
clarkb	logger_root writes to stdout. The explicit child loggers write to the log files (debug and non debug)	15:24
clarkb	but we're also propagating the logs up to the root so everything goes to stdout	15:24
corvus	maybe we can stop propagation of any messages we are explicitly handling, and still allow unhandled errors to go to stdout?	15:24
fungi	i'm not sure if there's more errors aside from tracebacks, i was looking at a slice of the logs from back before the disk filled up and tracebacks were what i got	15:24
clarkb	corvus: that seems reasonable I can work on a change for that	15:24
fungi	could easily be other output landing there too	15:24
corvus	yeah if they're coming from stdout, then i think it's any error message	15:25
corvus	2025-07-31 15:26:32,565 ERROR zuul.Launcher: openstack.exceptions.HttpException: HttpException: 500: Server Error for url: https://compute.public.mtl1.vexxhost.net/v2.1/86bbbcfa8ad043109d2d7af530225c72/servers/a2e228ac-0bbd-4295-aa4a-fa849766b144, Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.	15:27
clarkb	corvus: fungi actually there are only zuul, gerrit, and gear loggers. I think we may also need to capture and not propagate openstack?	15:27
clarkb	let me see if I can find the old nodepool logging config	15:27
corvus	clarkb: i'm not seeing openstack.* logs on stdout	15:28
clarkb	corvus: ok maybe we're reraising under zuul so everythign is qualified under zuul? what about requests and kazoo? those are the other two nodepool had	15:29
corvus	haven't seen those yet. i did see paramiko though	15:30
clarkb	ack. Do you think we should add them defensively or wiat for it to be a problem first?	15:30
corvus	let's only add what we know we need so we don't copy nodepool cruft	15:31
corvus	that vexxhost error may be a big part of this: that's being raised inside of a threadpoolexecutor, and i think something about how we're handling the error is causing an infinitely-growing traceback everytime it hits an exception handler	15:31
corvus	now that i've restarted the launchers, i'm going to try to figure out how to handle that better	15:32
fungi	yeah, the log entries i saw in random samples were all for zuul.Launcher	15:34
opendevreview	Clark Boylan proposed opendev/system-config master: Don't propagate zuul-launcher logs written to disk https://review.opendev.org/c/opendev/system-config/+/956258	15:36
clarkb	we may want to do similar with other zuul services but this seems like a decent starting point. I wasn't sure what level to log paramiko at so I did warning	15:36
fungi	i'm checking to see if i find anything else ending up in syslog, there's definitely stuff from paramiko.transport	15:37
clarkb	fungi: what log levels are the paramiko.transport logs? and if lower than warning do you think we need to capture the messages?	15:38
fungi	error	15:38
fungi	so far	15:38
clarkb	ok so warning is probably appropriate	15:38
fungi	though i do see zuul.Launcher warnings landing in syslog too, not just error	15:38
clarkb	fungi: zuul is being captured at a debug level	15:38
fungi	WARNING keystoneauth.session: Failure: Unable to establish connection to ...	15:39
clarkb	oh I see the console (stdout) logger is warning and above	15:40
clarkb	so ya I think warning is safe to capture for everything that was propagating	15:41
fungi	that's all i'm finding since the cleanup and restart	15:41
clarkb	let me add keystoneauth to this change	15:41
opendevreview	Clark Boylan proposed opendev/system-config master: Don't propagate zuul-launcher logs written to disk https://review.opendev.org/c/opendev/system-config/+/956258	15:42
fungi	prior to the cleanup, it's hard to mine things from the logs due to tracebacks being multi-line and not having any obvious prefix to filter on	15:43
clarkb	fungi: the tracebacks would include the python file paths?	15:46
fungi	and lines of code, and separate lines of visual underlining	15:47
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/956260 Launcher: handle delete state machine errors [NEW]	15:47
fungi	each step in the trace has a line for the file, a line for the code, and a line for the underlining	15:47
clarkb	fungi: ya so you'd do something like grep /path/to/code -A 2 or something	15:48
fungi	except i was trying to filter them out, not include them	15:48
fungi	so i want to drop the subsequent 2 lines whenever the line matches	15:49
fungi	there's probably a grep option for that, just need to revisit the manpage	15:49
clarkb	corvus: re 956260 does that match nodepool's cleanup appraoch to failures to delete? I know that if things failed to delete at some point we'd create a new znode to represent it then try again. Was that in the cleanup phase and not the initial delete?	15:50
clarkb	corvus: mostly wondering because I think the floating IP cleanup is still not working for that one floating ip in sjc3 and I don't want to rely on node cleanups here if we think there is a chance the leak detection is simply not working (I think its a different mechanism to fip cleanup though so its probably fine)	15:51
corvus	we stopped doing that a long time ago with nodepool	15:51
clarkb	corvus: ok so nodepool was doing pure leaked node cleanup like this change does in the event of a first failure?	15:52
corvus	yep	15:52
clarkb	ack in that case I'll assume it works and +2 the change. I do think fip cleanup happens via node listing (there is a comment in the code about this) so its in a different portion of the control flow too	15:53
corvus	i don't know that the leaked resource cleanup is broken (aside from the report on that fip, but as you say, that's different. i haven't looked into it).	15:53
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/956263 Openstack: don't log tracebacks for quota exceptions [NEW]	15:57
corvus	hopefully those all merge before the disk fills up again :)	15:59
corvus	btw i have restarted both zuul-web instances, so the "nodes" page should work now (after a reload)	16:00
fungi	clarkb: not introduced by your change, but what does the gerrit logger do for zuul launchers? what logs as "gerrit" on those servers?	16:01
clarkb	fungi: I'm not sure, but zuul does have valid gerrit logging in it elsewhere so I didn't want to remove it as it was already there	16:01
clarkb	I think the schedulers are where you're actually going to encournter gerrit logs. The other components probably don't run that code at all	16:02
corvus	yep, we could probably drop it	16:04
corvus	the launchers appear to be handling more requests now	16:09
fungi	clarkb: next hurdle for the eavesdrop replacement... looks like container names changed with the switch to podman maybe? or am i misreading https://zuul.opendev.org/t/openstack/build/bd9a6042d82c4983a103ed0453befd9b	16:10
clarkb	fungi: oh yes, I got all of that working in a separate change earlier I think you just need to squash it in. Once sec while I dig that up	16:11
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/955520 I think you can just copy that into your change and I can abandon my change	16:11
fungi	aha, can do. thanks! this will make it a lot easier	16:13
dpanech	Hi, some of hour reviews have been stuck for hours, eg: https://zuul.opendev.org/t/openstack/status?change=956113%2C4	16:16
fungi	dpanech: between ~13:15-15:30 utc we weren't booting new test nodes due to full filesystems on the launchers, but things should be running again now	16:18
fungi	https://grafana.opendev.org/d/21a6e53ea4/zuul-status indicates there's no backlog of node requests now	16:19
clarkb	also if you expand the job list and hover your mouse pointed on the status blob that says queued it will tell you where in the process the build is	16:19
clarkb	says it is wiating on a node request and then you can go look at grafana and see are all the nodes being used, is something wrong, etc	16:19
clarkb	and it just started	16:19
fungi	yeah, looks like it took about an hour for the launchers to catch up on the pending backlog of requests after they got cleaned up	16:21
opendevreview	Jeremy Stanley proposed opendev/system-config master: Replace eavesdrop01 with eavesdrop02 https://review.opendev.org/c/opendev/system-config/+/956122	16:25
fungi	i need to skip out shortly to an appointment, may be a couple of hours. will probably also disappear later for dinner, but am planning to focus on making sure the eavesdrop replacement change is ready for tomorrow	16:26
dpanech	clarkb: fungi: they merged, thanks	16:28
clarkb	fungi: I suspect the change is close if not ready now	16:31
fungi	yeah, if system-config-run-eavesdrop works now, which i think it will finally this time, the rest should already be passing	16:35
opendevreview	Jeremy Stanley proposed opendev/system-config master: Link VEXXHOST donor logo to testimonial article https://review.opendev.org/c/opendev/system-config/+/956268	16:53
fungi	^ hot off the presses	16:54
fungi	yay! system-config-run-eavesdrop is now passing on 956122, so should be all set for tomorrow	17:11
clarkb	lgtm	17:12
fungi	okay, headed out to appointment and then food, will be a few hours probably	17:31
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/956289 Add unit tests (and fixes) for selectProviders	18:40
corvus	that will make me feel a bit better about those ready node changes	18:41
clarkb	corvus: I posted some questions to that change	19:04
opendevreview	Merged opendev/system-config master: Don't propagate zuul-launcher logs written to disk https://review.opendev.org/c/opendev/system-config/+/956258	19:18
clarkb	corvus: ^ I think that requires a restart of the service to pick up but we can probably do that when 956289 lands?	19:53

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!