Thursday, 2025-01-09

clarkbI managed to get the vast majority of the paperwork I need to do done today so I'm hopeful tomorrow I'll start looking at Noble paste server00:09
fricklerinfra-root: checking the log on bridge for https://zuul.opendev.org/t/openstack/build/764061aeb33b4d5b8eb9cc17b7274252 it looks like access to opendevci-raxflex is broken while opendevzuul-raxflex (the nodepool tenant) is still fine. is there anything known about that?06:21
fricklerI can confirm getting a 401 for the former cloud on bridge when testing myself06:22
fricklerhmm, actually this seems to have started failing on 2024-11-22. oldest log we have is 2024-12-10T02:38:15 and that already shows the failure07:46
frickleranyway, seems my cleanup change hasn't been deployed yet because infra-prod-base ran before the repo was updated on bridge, so I'll have to wait another day to confirm it works as expected. todays u-a mails were cluttered anyway due to bind9 updates07:57
fricklerinfra-root: looks like there is a issue with the github token used by zuul that I recently updated, see the discussion in #openstack-infra09:51
fricklerI've updated the token now and in addition manually updated the zuul.conf on the schedulers, but I'm unsure if a) this is urgent enough to warrant a manual zuul restart and b) whether https://docs.opendev.org/opendev/system-config/latest/zuul.html#restarting-zuul-services is still up-to-date09:53
fricklerchandankumar: fyi ^^ I'll wait for feedback from other admins before proceeding09:54
chandankumarsure, thank you :-)09:54
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: End gating for qdrouterd role  https://review.opendev.org/c/openstack/project-config/+/93818811:11
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Remove qdrouterd role from the config  https://review.opendev.org/c/openstack/project-config/+/93819411:11
opendevreviewMerged openstack/project-config master: End gating for qdrouterd role  https://review.opendev.org/c/openstack/project-config/+/93818813:44
opendevreviewMerged openstack/project-config master: End gating for Freezer DR project  https://review.opendev.org/c/openstack/project-config/+/93818013:44
opendevreviewMerged openstack/project-config master: Remove Murano/Senlin/Sahara OSA roles from infra  https://review.opendev.org/c/openstack/project-config/+/93568313:49
opendevreviewMerged opendev/system-config master: Mirror jaegertracing  https://review.opendev.org/c/opendev/system-config/+/93869015:45
clarkbopendevci-raxflex not working but opendevzuul-raxflex working is not known or expected. The credentials are the same as our regular rax stuff iirc but we point at a different keystone or something likethat15:59
clarkbfrickler: also you are correct that document for restarting zuul looks to still be assuming the single scheduler setup and isn't correct15:59
clarkbfrickler: at this point I think we can safely restart any single zuul service except for the executors those need to be gracefully shutdown to avoid noticeable impact16:00
clarkbfrickler: https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul_reboot.yaml this is what the weekly upgrades run16:00
clarkblooks like it basically does a scheduler stop, then reboot, then start. So I think you can do a stop then start (no reboot) on each scheduler one at at time. May need to do that for the web processes too. I'm not sure if they use the github token16:01
clarkbI can confirm that openstack client reproduces the raxflex auth issue. However authenticating to old rax does not fail. Visual inspection of clouds.yaml content doesn't show an obvious differences in the bits that are raxflex specific between opendevci-raxflex and opendevzuul-raxflex nor do I see an obvious differences in the shared bits between opendevci-raxflex and openstackci-rax16:16
fungii wonder if it's a domain caching problem. has anyone tried logging into skyline with that account yet?16:16
clarkbthe error occurs when doing POST /v3/auth/tokens you get back a 40116:17
clarkbfungi: I have not. The extent of my debugging is above16:17
fungihalf suspecting that if we log into the webui once, everything else will start working again16:18
fungibut i'm hesitant to do that if we want to give one of the flex folks an opportunity to take a look first16:18
clarkboh right seems like you hit similar when we first set that up16:18
fungiright, that's why i'm coming back to it16:19
clarkbcardoe: ^ fyi. Doesn't look like klamath is in the channel right now16:26
fricklerok, so I'll restart (stop+start) the schedulers on zuul01+02 now. if the web services need a restart, that's find to wait another two days I'd think16:26
cardoeLooking.16:26
clarkbfrickler: the important thing is you do them one at a time and wait for them to fully come up before doing the next one16:26
frickleryes, sure16:26
clarkbfrickler: I think it may take up to like 20 minutes for a scheduler to fully come up16:26
clarkbthe https://zuul.opendev.org/components list is a good way to verify it is up after hte logs look good16:28
clarkboh ya that is what the playbook does before proceeding too16:30
fricklerok, if I reads those playbooks correctly, it does "cd /etc/zuul-scheduler/; docker-compose down" and then "... up -d", right?16:32
clarkbfrickler: yes that is my read of the ansible too16:32
fricklerwith some sudo added. ok, will do that on zuul01 now16:33
clarkbthe components list shows 01 is initializing so thats good. Now we wait16:35
clarkbfrickler: note that did end up picking up one new commit in zuul. But that commit is a css only update so shouldn't affect anything outside of browsers until we restart webs. And even then the impacti s minimal16:39
clarkbcomponents list shows the scheduler is running now. That was faster than I expected.16:41
fricklerok, logs also look normal to me, doing zuul02 now16:48
fricklerjobs are running now for https://zuul.opendev.org/t/openstack/status?change=938776 after a recheck, so the new github token is working better, too16:53
frickler#status log restarted zuul-scheduler on zuul01+02 in order to pick up a new github api token16:54
opendevstatusfrickler: finished logging16:54
clarkbthank you for taking care of that17:01
fricklerlogs on zuul02 also look fine, so I consider this topic done for now17:01
fungithanks!17:03
opendevreviewClark Boylan proposed opendev/system-config master: Update to Gitea 1.23  https://review.opendev.org/c/opendev/system-config/+/93882617:05
clarkbif ^ passes I'll work on holding a node so that we can check it more interactively17:05
cardoeclarkb: did you create a newer account and are trying to use it in that project?17:48
clarkbcardoe: no17:49
cardoeThey did some swizzling that newer accounts do not get some default permission thing to see projects.17:49
clarkbits the same credentials that we use against rax classic. fungi ran into similar when we first set this up and apparently logging into skyline at the time got things connect between the two system.17:49
fungiyeah, sounds like whatever broke this was back in november, we only just noticed we can't authenticate to it17:51
cardoeWhat are the actual usernames and/or tenant ID?18:21
fricklercardoe: broken one is username: 'openstackci', project_id: 'ed36be9ee2a84911982e95636a85698c'18:42
fricklerfor comparison, the working one is openstackjenkins / f063ac0bb70c486db47bcf2105eebcbd18:42
funginote that we ended up using the local cloud project ids because of the sync problem i hit during early setup where the nnnnn/Flex federated ids were rejected by the api until i first logged into skyline18:54
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node  https://review.opendev.org/c/opendev/system-config/+/84818118:56
fungii think something may be amiss with nl01, i'm getting random connection resets trying to ssh into it, but sometimes i get through18:57
clarkbtkajinam: do you still need your heat-functional held node?18:57
clarkbfungi: nl01 pings from here lgtm over ipv4. Could it be ipv6 specific?18:58
fungimaybe18:58
funginope18:58
fricklernl01 works fine for me over ipv6 (by default, didn't explicitly choose it)18:58
fungifungi@dhole:~$ ossh -4 nl0118:58
fungiConnection closed by 104.239.144.109 port 2218:58
fungiseems i get through about 50% of the time regardless of whether i'm going via v4 or v618:59
fungiand nl02 works every time18:59
fricklerI do see the same when forcing v419:00
fungi0% icmp packet loss over both v4 and v619:00
fungiguessing it's something to do with state handling on its hypervisor host19:00
fungiprobably a noisy neighbor filling up the conntrack tables or something19:01
clarkbI just did ~10 ssh connections with no problem19:01
fungiweird19:01
fungithen i guess it could be further into the network layer19:01
fungimaybe their core flow handling has my sessions balanced 50% onto a bad switch and not yours or something19:02
clarkbya its interesting that it reports the remote is closing the connection19:03
clarkbbut I guess that could happen due to being unhappy about asymetrically routed packets19:03
fungiit could be just about anything. lots of layer 3/4 middleboxen will spoof tcp/rst packets on behalf of the peer19:04
fungiincluding poorly-behaved rate limiters and packet shapers19:04
fricklersince clarkb logged it, I'm not getting failures anymore either. problem fixed ;)19:05
fungihah19:05
fricklers/it/in/19:05
fungiand yeah, it's cleared up for me now too, whatever it was19:05
fungimaybe the problem got routed around19:06
fungior a flood somewhere died off19:06
fungisplitting here as a tangent from a discussion in the zuul matrix channel, is anyone else surprised that a held jammy node shows an installed version of podman from the original jammy release rather than a newer version available in jammy-security and jammy-updates?19:12
clarkbdoes the job specify the version maybe?19:13
fungioh maybe... dpkg.log shows it being installed during the job at least, so it's not preinstalled19:15
fungiyep19:17
fungipodman=3.4.4+ds1-1ubuntu119:17
fungihttps://review.opendev.org/c/zuul/zuul-jobs/+/886552 Pin podman on jammy19:20
fungioh, right, that was because of https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/202439419:20
cardoefungi/clarkb: auth should be fixed19:50
fungithanks cardoe!!!19:50
fungifrickler: ^19:51
clarkbcardoe: fungi: confirmed osc server list is working for me now19:51
cardoeSo I misspoke to the "new accounts to not see projects"19:51
clarkbcardoe: would logging in through skyline have made a difference? Mostly curious if that thought was relevant19:55
cardoeEssentially there's some backend side this account is authorized for this product type thing. When they started Flex it was just using the existing rax openstack value. They made them have a new value that says flex so you had to have that to auth against flex. They tried to use some heuristics as to who used flex and should get it. And they missed.19:57
cardoeSo skyline usage wouldn't have fixed it.19:57
clarkbgotcha19:59
fricklerjust strange that it seems to have been working for some time, then started failing like 6 weeks ago20:02
fricklerah that's likely the point where that heuristic got applied?20:04
fungiinterestingly my personal rackspace account which has server instances in both rax classic and flex seems to be working fine20:05
fungiso i guess we just got lucky with the control plane account20:05
fungi(un)lucky20:05
frickleranyway, good that it is working now and I'll recheck the results of the periodic jobs tomorrow20:06
clarkbhttps://200.225.47.39:3081/opendev/system-config is the held gitea 1.23.020:48
clarkbnot sure if we want to try and get that in quickly due to it theoretically fixing the memroy leak or wait for a .1. Typically they have a .1 not too long after the .0 release20:48
clarkbtonyb: I've gotten to the point of getting ready to boot a Noble paste02 server and I don't see the noble image in rax dfw (or ord and iad). I thought you had converted and uploaded noble images everywhere. Am I misremembering?20:54
clarkbLooks like on july 25th we discussed this and https://paste.opendev.org/raw/b8QSrgmqdHkbHY0S6801/ was going to be uploaded21:01
clarkbbut I'm not seeing those images when running image list or image list --shared21:02
clarkbmakes me wonder if the upload failed or maybe we didn't think we had to do the dance to upload via glance tasks but we do? or maybe the images were removed?21:03
jaltman>***fungi sighs at https://bugs.debian.org/1089501 having been open for weeks now with no confirmation (also i hunted for upstream bug reports and changes in their gerrit, but no dice)21:04
jaltmanhttps://www.openafs.org/dl/openafs/1.8.13.1/RELNOTES-1.8.13.1 released on 19-Dec-202421:04
jaltman Linux clients      * Add support for Linux 6.12 (15964, 15965, 15966) 21:04
clarkbtonyb: I do see the image in other clouds that I've spot checked. Hoping you acn remember what happened there.21:05
fungijaltman: thanks! i hadn't spotted the .1 release21:06
fungii'll prod the debian package maintainer21:06
clarkbI see the vhd image file in tonyb's homedir on bridge so it did get created at least21:06
clarkbs/created/converted from raw/21:06
clarkbI'm suddenly worried we're going to go through quite the experience just to upload an image to rax. But we know it is doable because nodepool does it21:11
clarkbcloudnull not sure if this is on your radar yet but rax classic doesn't have Ubuntu Noble images yet (I thought we had uploaded our own, but doesn't look like we did hence my discovery the cloud doesn't provide any yet either)21:15
cardoefrickler: yeah 6 weeks sounds about right21:18
opendevreviewMerged zuul/zuul-jobs master: Add ability to exclude a specific platform  https://review.opendev.org/c/zuul/zuul-jobs/+/92616422:55
opendevreviewMerged zuul/zuul-jobs master: ensure-podman: add tasks to configure socket group  https://review.opendev.org/c/zuul/zuul-jobs/+/92591622:55
clarkbI'm not sure if tonyb is around today or not but I think he is on vacation starting tomorrow? I guess I'll give until then for feedback on the image otherwise I'll try to convert a current image to vhd and then proceed to attempting to upload it to the three rax classic regions22:56
clarkbI've started brainstorming a rough outline of topics for a january opendev meetup in https://etherpad.opendev.org/p/opendev-january-2025-meetup23:41
clarkbright now there isn't really an order I'm just braindumping topics. Feel free to add your own and then as we get closer to a schedule/plan I'll try to organize things23:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!