Thursday, 2025-07-17

*** elodilles is now known as elodilles_pto		06:15
*** cloudnull1097746 is now known as cloudnull109774		07:21
noonedeadpunk	fwiw gitea13 seems to be quite slow today...	10:11
opendevreview	Merged openstack/project-config master: [release-tools] Improve tagging logs https://review.opendev.org/c/openstack/project-config/+/954340	10:42
opendevreview	Merged openstack/project-config master: Remove nodepool config jobs https://review.opendev.org/c/openstack/project-config/+/955235	10:47
stephenfin	fungi: clarkb: all comments on the pbr series are now addressed, afaict	12:07
ykarel	corvus, frickler still seeing that multi provider multinode job issue in periodic, is there any fix still not merged?	12:33
ykarel	it improved, but still seen, todays example https://zuul.openstack.org/build/fd909b7265c044a786e39c66e57984f7	12:33
ykarel	mixed node from rax-ord-main and ovh-gra1-main	12:34
fungi	apparently my shell server fell offline/rebooted at utc midnight. i'll check the bot-recorded logs for important channels, but ping me if there's something that needs my urgent attention since i may have missed it	13:20
frickler	fungi: ykarel reported another "distributed multinode" issue just earlier, maybe not urgent, but still worth looking into? https://zuul.openstack.org/build/fd909b7265c044a786e39c66e57984f7	13:39
fungi	thanks, looks like that was a periodic job too, so probably still us overwhelming one or more of our providers	13:45
fungi	keep in mind that booting multi-node jobs across providers is still possible because it's a new feature of zuul, without that fallback attempt the job would have ended in a node_failure instead	13:47
fungi	so short of making it possible to turn that feature off and going back to node_failure results for those cases, we can continue trying to fine-tune our use of various providers in order to avoid overloading them	13:48
corvus	yep -- adding a feature to turn that into a node_failure is definitely on the table; i just think this is really helpful for finding issues and improving things, so i don't want to do it until we've fixed all the zuul bugs at least. :)	14:03
corvus	ykarel: thanks for the report, i'll look into it in a bit	14:03
ykarel	thx frickler fungi corvus	14:18
corvus	ykarel: frickler fungi it looks like there is indeed an opportunity for improvement here: https://review.opendev.org/c/zuul/zuul/+/955292 Handle another form of openstack quota error message	14:25
fungi	thanks!	14:48
fungi	heading to lunch, back in an hour	14:53
Matko[m]	Am I the only one having issues reaching https://releases.openstack.org/ ?	15:47
corvus	it does seem slow to me	15:54
corvus	the server is up and lightly loaded	15:55
fungi	load average: 0.06, 0.06, 0.06	15:56
fungi	doesn't seem that loaded from a system level	15:57
fungi	yeah, agreed	15:57
corvus	i'm not seeing packet loss from my workstation	15:57
fungi	could dns resolution be slow?	15:58
Matko[m]	I tried to reach from 3 different sources and also asked a college to test. All tests show issues getting a response from the server. ping works fine, resolution is fine too. I get the IP but connection to 443/tcp is slow.	15:58
fungi	interesting, i had it up in my browser before i went to lunch, and when i reloaded the page just now i got "This site can’t be reached: releases.openstack.org took too long to respond."	15:58
corvus	dns responds quickly here and looks good	15:59
Matko[m]	curl is stuck at "(304) (OUT), TLS handshake, Client hello (1)"	15:59
fungi	no afs problems mentioned in dmesg	16:00
corvus	i wonder if we have stuck apache workers or something.	16:00
fungi	but yeah, i'm seeing no packet loss at all over ipv4 or v6	16:00
fungi	checking apache logs	16:01
Matko[m]	network connectivity is fine, no packet loss according to mtr.	16:01
Matko[m]	it feels more like if http server is too busy or don't have workers available to respond.	16:02
corvus	yeah... curl to localhost on the server is failing similarly	16:03
fungi	there's a "AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting" but that was ~10 hours ago	16:03
corvus	i suspect we're going to want to restart apache, but before we do, i'm trying to get a copy of the server-status	16:03
fungi	right, i was refraining from restarting until we're satisfied we've collected everything we can	16:04
fungi	what's odd is that i loaded that exact site right before i went to lunch, so whatever's transpired it happened or finally tipped over in the past hour-ish	16:06
fungi	also trying random other vhosts served from there i'm getting the same behavior out of all of them, so at least it's server-wide and not just the releases.openstack.org site	16:07
fungi	the system load is so low it's consistent with apache not actually serving anything at all	16:08
fungi	(even though apache processes are still running and are the most active processes on the system)	16:08
corvus	it's non-zero, just very close to zero	16:09
corvus	(traffic, that is)	16:09
Matko[m]	any potential syn flood attack, slowloris (if that's still a thing) or something similar happening? or does it seem only isolated to the apache process itself?	16:10
fungi	the sites we serve aren't a likely target for those sorts of attacks, though yes this is the sort of behavior i'd expect from one (granted usually interrupt processing would be a lot higher, it's at basically zero too)	16:11
fungi	in this case 100% of the (virtually nonexistent) load is "user"	16:12
fungi	so it's not like the kernel or tcp/ip stack is busy dealing with anything	16:12
corvus	(one of my two attempts at server-status has failed; still waiting on the second one)	16:13
fungi	so more like apache is non-responsive due to a deadlock/livelock somewhere	16:13
corvus	second one failed	16:13
corvus	we should probably give up on that and just restart now	16:14
fungi	and it's not limited to ssl handshakes, i'm seeing timeouts just getting http responses over 80.tcp	16:14
frickler	+1 to restart	16:15
fungi	i can restart apache unless someone else is already in progress	16:16
corvus	fungi: go for it	16:16
fungi	doing a full stop and start	16:16
fungi	after stopping there was no apache process running, starting again now	16:17
fungi	content is immediately loading for me	16:17
Matko[m]	looks good to me now	16:17
fungi	Matko[m]: all looking correct now?	16:17
fungi	you beat me to it	16:18
Matko[m]	yes, all good. thanks for looking into it!	16:18
fungi	#status log Restarted apache services on static.opendev.org in order to clear a mysterious hung connection issue that cropped up in the past hour, investigation underway	16:18
fungi	thanks for the heads up Matko[m]!	16:18
opendevstatus	fungi: finished logging	16:19
fungi	my money's on an earlier overload from llm training crawlers triggering some corner case bug that hung all the apache workeers	16:21
corvus	infra-root: if that happens again, try curl -o server-status.html https://static.opendev.org/server-status from the host itself so we can get a snapshot of the apache processes	16:21
* fungi is just going to blame ai for everything any more		16:21
fungi	load average on the server has climbed, though not alarmingly, more just approaching what i's expect to see under healthy usage	16:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!