*** elodilles is now known as elodilles_pto | 06:15 | |
*** cloudnull1097746 is now known as cloudnull109774 | 07:21 | |
noonedeadpunk | fwiw gitea13 seems to be quite slow today... | 10:11 |
---|---|---|
opendevreview | Merged openstack/project-config master: [release-tools] Improve tagging logs https://review.opendev.org/c/openstack/project-config/+/954340 | 10:42 |
opendevreview | Merged openstack/project-config master: Remove nodepool config jobs https://review.opendev.org/c/openstack/project-config/+/955235 | 10:47 |
stephenfin | fungi: clarkb: all comments on the pbr series are now addressed, afaict | 12:07 |
ykarel | corvus, frickler still seeing that multi provider multinode job issue in periodic, is there any fix still not merged? | 12:33 |
ykarel | it improved, but still seen, todays example https://zuul.openstack.org/build/fd909b7265c044a786e39c66e57984f7 | 12:33 |
ykarel | mixed node from rax-ord-main and ovh-gra1-main | 12:34 |
fungi | apparently my shell server fell offline/rebooted at utc midnight. i'll check the bot-recorded logs for important channels, but ping me if there's something that needs my urgent attention since i may have missed it | 13:20 |
frickler | fungi: ykarel reported another "distributed multinode" issue just earlier, maybe not urgent, but still worth looking into? https://zuul.openstack.org/build/fd909b7265c044a786e39c66e57984f7 | 13:39 |
fungi | thanks, looks like that was a periodic job too, so probably still us overwhelming one or more of our providers | 13:45 |
fungi | keep in mind that booting multi-node jobs across providers is still possible because it's a new feature of zuul, without that fallback attempt the job would have ended in a node_failure instead | 13:47 |
fungi | so short of making it possible to turn that feature off and going back to node_failure results for those cases, we can continue trying to fine-tune our use of various providers in order to avoid overloading them | 13:48 |
corvus | yep -- adding a feature to turn that into a node_failure is definitely on the table; i just think this is really helpful for finding issues and improving things, so i don't want to do it until we've fixed all the zuul bugs at least. :) | 14:03 |
corvus | ykarel: thanks for the report, i'll look into it in a bit | 14:03 |
ykarel | thx frickler fungi corvus | 14:18 |
corvus | ykarel: frickler fungi it looks like there is indeed an opportunity for improvement here: https://review.opendev.org/c/zuul/zuul/+/955292 Handle another form of openstack quota error message | 14:25 |
fungi | thanks! | 14:48 |
fungi | heading to lunch, back in an hour | 14:53 |
Matko[m] | Am I the only one having issues reaching https://releases.openstack.org/ ? | 15:47 |
corvus | it does seem slow to me | 15:54 |
corvus | the server is up and lightly loaded | 15:55 |
fungi | load average: 0.06, 0.06, 0.06 | 15:56 |
fungi | doesn't seem that loaded from a system level | 15:57 |
fungi | yeah, agreed | 15:57 |
corvus | i'm not seeing packet loss from my workstation | 15:57 |
fungi | could dns resolution be slow? | 15:58 |
Matko[m] | I tried to reach from 3 different sources and also asked a college to test. All tests show issues getting a response from the server. ping works fine, resolution is fine too. I get the IP but connection to 443/tcp is slow. | 15:58 |
fungi | interesting, i had it up in my browser before i went to lunch, and when i reloaded the page just now i got "This site can’t be reached: releases.openstack.org took too long to respond." | 15:58 |
corvus | dns responds quickly here and looks good | 15:59 |
Matko[m] | curl is stuck at "(304) (OUT), TLS handshake, Client hello (1)" | 15:59 |
fungi | no afs problems mentioned in dmesg | 16:00 |
corvus | i wonder if we have stuck apache workers or something. | 16:00 |
fungi | but yeah, i'm seeing no packet loss at all over ipv4 or v6 | 16:00 |
fungi | checking apache logs | 16:01 |
Matko[m] | network connectivity is fine, no packet loss according to mtr. | 16:01 |
Matko[m] | it feels more like if http server is too busy or don't have workers available to respond. | 16:02 |
corvus | yeah... curl to localhost on the server is failing similarly | 16:03 |
fungi | there's a "AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting" but that was ~10 hours ago | 16:03 |
corvus | i suspect we're going to want to restart apache, but before we do, i'm trying to get a copy of the server-status | 16:03 |
fungi | right, i was refraining from restarting until we're satisfied we've collected everything we can | 16:04 |
fungi | what's odd is that i loaded that exact site right before i went to lunch, so whatever's transpired it happened or finally tipped over in the past hour-ish | 16:06 |
fungi | also trying random other vhosts served from there i'm getting the same behavior out of all of them, so at least it's server-wide and not just the releases.openstack.org site | 16:07 |
fungi | the system load is *so* low it's consistent with apache not actually serving anything at all | 16:08 |
fungi | (even though apache processes are still running and are the most active processes on the system) | 16:08 |
corvus | it's non-zero, just very close to zero | 16:09 |
corvus | (traffic, that is) | 16:09 |
Matko[m] | any potential syn flood attack, slowloris (if that's still a thing) or something similar happening? or does it seem only isolated to the apache process itself? | 16:10 |
fungi | the sites we serve aren't a likely target for those sorts of attacks, though yes this is the sort of behavior i'd expect from one (granted usually interrupt processing would be a lot higher, it's at basically zero too) | 16:11 |
fungi | in this case 100% of the (virtually nonexistent) load is "user" | 16:12 |
fungi | so it's not like the kernel or tcp/ip stack is busy dealing with anything | 16:12 |
corvus | (one of my two attempts at server-status has failed; still waiting on the second one) | 16:13 |
fungi | so more like apache is non-responsive due to a deadlock/livelock somewhere | 16:13 |
corvus | second one failed | 16:13 |
corvus | we should probably give up on that and just restart now | 16:14 |
fungi | and it's not limited to ssl handshakes, i'm seeing timeouts just getting http responses over 80.tcp | 16:14 |
frickler | +1 to restart | 16:15 |
fungi | i can restart apache unless someone else is already in progress | 16:16 |
corvus | fungi: go for it | 16:16 |
fungi | doing a full stop and start | 16:16 |
fungi | after stopping there was no apache process running, starting again now | 16:17 |
fungi | content is immediately loading for me | 16:17 |
Matko[m] | looks good to me now | 16:17 |
fungi | Matko[m]: all looking correct now? | 16:17 |
fungi | you beat me to it | 16:18 |
Matko[m] | yes, all good. thanks for looking into it! | 16:18 |
fungi | #status log Restarted apache services on static.opendev.org in order to clear a mysterious hung connection issue that cropped up in the past hour, investigation underway | 16:18 |
fungi | thanks for the heads up Matko[m]! | 16:18 |
opendevstatus | fungi: finished logging | 16:19 |
fungi | my money's on an earlier overload from llm training crawlers triggering some corner case bug that hung all the apache workeers | 16:21 |
corvus | infra-root: if that happens again, try curl -o server-status.html https://static.opendev.org/server-status from the host itself so we can get a snapshot of the apache processes | 16:21 |
* fungi is just going to blame ai for everything any more | 16:21 | |
fungi | load average on the server has climbed, though not alarmingly, more just approaching what i's expect to see under healthy usage | 16:34 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!