Thursday, 2025-07-17

*** elodilles is now known as elodilles_pto06:15
*** cloudnull1097746 is now known as cloudnull10977407:21
noonedeadpunkfwiw gitea13 seems to be quite slow today...10:11
opendevreviewMerged openstack/project-config master: [release-tools] Improve tagging logs  https://review.opendev.org/c/openstack/project-config/+/95434010:42
opendevreviewMerged openstack/project-config master: Remove nodepool config jobs  https://review.opendev.org/c/openstack/project-config/+/95523510:47
stephenfinfungi: clarkb: all comments on the pbr series are now addressed, afaict12:07
ykarelcorvus, frickler still seeing that multi provider multinode job issue in periodic, is there any fix still not merged?12:33
ykarelit improved, but still seen, todays example https://zuul.openstack.org/build/fd909b7265c044a786e39c66e57984f712:33
ykarelmixed node from rax-ord-main and ovh-gra1-main12:34
fungiapparently my shell server fell offline/rebooted at utc midnight. i'll check the bot-recorded logs for important channels, but ping me if there's something that needs my urgent attention since i may have missed it13:20
fricklerfungi: ykarel reported another "distributed multinode" issue just earlier, maybe not urgent, but still worth looking into? https://zuul.openstack.org/build/fd909b7265c044a786e39c66e57984f713:39
fungithanks, looks like that was a periodic job too, so probably still us overwhelming one or more of our providers13:45
fungikeep in mind that booting multi-node jobs across providers is still possible because it's a new feature of zuul, without that fallback attempt the job would have ended in a node_failure instead13:47
fungiso short of making it possible to turn that feature off and going back to node_failure results for those cases, we can continue trying to fine-tune our use of various providers in order to avoid overloading them13:48
corvusyep -- adding a feature to turn that into a node_failure is definitely on the table; i just think this is really helpful for finding issues and improving things, so i don't want to do it until we've fixed all the zuul bugs at least.  :)14:03
corvusykarel: thanks for the report, i'll look into it in a bit14:03
ykarelthx frickler fungi corvus 14:18
corvusykarel: frickler fungi it looks like there is indeed an opportunity for improvement here:  https://review.opendev.org/c/zuul/zuul/+/955292 Handle another form of openstack quota error message14:25
fungithanks!14:48
fungiheading to lunch, back in an hour14:53
Matko[m]Am I the only one having issues reaching https://releases.openstack.org/ ?15:47
corvusit does seem slow to me15:54
corvusthe server is up and lightly loaded15:55
fungiload average: 0.06, 0.06, 0.0615:56
fungidoesn't seem that loaded from a system level15:57
fungiyeah, agreed15:57
corvusi'm not seeing packet loss from my workstation15:57
fungicould dns resolution be slow?15:58
Matko[m]I tried to reach from 3 different sources and also asked a college to test. All tests show issues getting a response from the server. ping works fine, resolution is fine too. I get the IP but connection to 443/tcp is slow.15:58
fungiinteresting, i had it up in my browser before i went to lunch, and when i reloaded the page just now i got "This site can’t be reached: releases.openstack.org took too long to respond."15:58
corvusdns responds quickly here and looks good15:59
Matko[m]curl is stuck at "(304) (OUT), TLS handshake, Client hello (1)"15:59
fungino afs problems mentioned in dmesg16:00
corvusi wonder if we have stuck apache workers or something.16:00
fungibut yeah, i'm seeing no packet loss at all over ipv4 or v616:00
fungichecking apache logs16:01
Matko[m]network connectivity is fine, no packet loss according to mtr.16:01
Matko[m]it feels more like if http server is too busy or don't have workers available to respond.16:02
corvusyeah... curl to localhost on the server is failing similarly16:03
fungithere's a "AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting" but that was ~10 hours ago16:03
corvusi suspect we're going to want to restart apache, but before we do, i'm trying to get a copy of the server-status16:03
fungiright, i was refraining from restarting until we're satisfied we've collected everything we can16:04
fungiwhat's odd is that i loaded that exact site right before i went to lunch, so whatever's transpired it happened or finally tipped over in the past hour-ish16:06
fungialso trying random other vhosts served from there i'm getting the same behavior out of all of them, so at least it's server-wide and not just the releases.openstack.org site16:07
fungithe system load is *so* low it's consistent with apache not actually serving anything at all16:08
fungi(even though apache processes are still running and are the most active processes on the system)16:08
corvusit's non-zero, just very close to zero16:09
corvus(traffic, that is)16:09
Matko[m]any potential syn flood attack, slowloris (if that's still a thing) or something similar happening? or does it seem only isolated to the apache process itself?16:10
fungithe sites we serve aren't a likely target for those sorts of attacks, though yes this is the sort of behavior i'd expect from one (granted usually interrupt processing would be a lot higher, it's at basically zero too)16:11
fungiin this case 100% of the (virtually nonexistent) load is "user"16:12
fungiso it's not like the kernel or tcp/ip stack is busy dealing with anything16:12
corvus(one of my two attempts at server-status has failed; still waiting on the second one)16:13
fungiso more like apache is non-responsive due to a deadlock/livelock somewhere16:13
corvussecond one failed16:13
corvuswe should probably give up on that and just restart now16:14
fungiand it's not limited to ssl handshakes, i'm seeing timeouts just getting http responses over 80.tcp16:14
frickler+1 to restart16:15
fungii can restart apache unless someone else is already in progress16:16
corvusfungi: go for it16:16
fungidoing a full stop and start16:16
fungiafter stopping there was no apache process running, starting again now16:17
fungicontent is immediately loading for me16:17
Matko[m]looks good to me now16:17
fungiMatko[m]: all looking correct now?16:17
fungiyou beat me to it16:18
Matko[m]yes, all good. thanks for looking into it!16:18
fungi#status log Restarted apache services on static.opendev.org in order to clear a mysterious hung connection issue that cropped up in the past hour, investigation underway16:18
fungithanks for the heads up Matko[m]!16:18
opendevstatusfungi: finished logging16:19
fungimy money's on an earlier overload from llm training crawlers triggering some corner case bug that hung all the apache workeers16:21
corvusinfra-root: if that happens again, try curl -o server-status.html https://static.opendev.org/server-status from the host itself so we can get a snapshot of the apache processes16:21
* fungi is just going to blame ai for everything any more16:21
fungiload average on the server has climbed, though not alarmingly, more just approaching what i's expect to see under healthy usage16:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!