Thursday, 2025-01-09

clarkb	I managed to get the vast majority of the paperwork I need to do done today so I'm hopeful tomorrow I'll start looking at Noble paste server	00:09
frickler	infra-root: checking the log on bridge for https://zuul.opendev.org/t/openstack/build/764061aeb33b4d5b8eb9cc17b7274252 it looks like access to opendevci-raxflex is broken while opendevzuul-raxflex (the nodepool tenant) is still fine. is there anything known about that?	06:21
frickler	I can confirm getting a 401 for the former cloud on bridge when testing myself	06:22
frickler	hmm, actually this seems to have started failing on 2024-11-22. oldest log we have is 2024-12-10T02:38:15 and that already shows the failure	07:46
frickler	anyway, seems my cleanup change hasn't been deployed yet because infra-prod-base ran before the repo was updated on bridge, so I'll have to wait another day to confirm it works as expected. todays u-a mails were cluttered anyway due to bind9 updates	07:57
frickler	infra-root: looks like there is a issue with the github token used by zuul that I recently updated, see the discussion in #openstack-infra	09:51
frickler	I've updated the token now and in addition manually updated the zuul.conf on the schedulers, but I'm unsure if a) this is urgent enough to warrant a manual zuul restart and b) whether https://docs.opendev.org/opendev/system-config/latest/zuul.html#restarting-zuul-services is still up-to-date	09:53
frickler	chandankumar: fyi ^^ I'll wait for feedback from other admins before proceeding	09:54
chandankumar	sure, thank you :-)	09:54
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: End gating for qdrouterd role https://review.opendev.org/c/openstack/project-config/+/938188	11:11
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Remove qdrouterd role from the config https://review.opendev.org/c/openstack/project-config/+/938194	11:11
opendevreview	Merged openstack/project-config master: End gating for qdrouterd role https://review.opendev.org/c/openstack/project-config/+/938188	13:44
opendevreview	Merged openstack/project-config master: End gating for Freezer DR project https://review.opendev.org/c/openstack/project-config/+/938180	13:44
opendevreview	Merged openstack/project-config master: Remove Murano/Senlin/Sahara OSA roles from infra https://review.opendev.org/c/openstack/project-config/+/935683	13:49
opendevreview	Merged opendev/system-config master: Mirror jaegertracing https://review.opendev.org/c/opendev/system-config/+/938690	15:45
clarkb	opendevci-raxflex not working but opendevzuul-raxflex working is not known or expected. The credentials are the same as our regular rax stuff iirc but we point at a different keystone or something likethat	15:59
clarkb	frickler: also you are correct that document for restarting zuul looks to still be assuming the single scheduler setup and isn't correct	15:59
clarkb	frickler: at this point I think we can safely restart any single zuul service except for the executors those need to be gracefully shutdown to avoid noticeable impact	16:00
clarkb	frickler: https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul_reboot.yaml this is what the weekly upgrades run	16:00
clarkb	looks like it basically does a scheduler stop, then reboot, then start. So I think you can do a stop then start (no reboot) on each scheduler one at at time. May need to do that for the web processes too. I'm not sure if they use the github token	16:01
clarkb	I can confirm that openstack client reproduces the raxflex auth issue. However authenticating to old rax does not fail. Visual inspection of clouds.yaml content doesn't show an obvious differences in the bits that are raxflex specific between opendevci-raxflex and opendevzuul-raxflex nor do I see an obvious differences in the shared bits between opendevci-raxflex and openstackci-rax	16:16
fungi	i wonder if it's a domain caching problem. has anyone tried logging into skyline with that account yet?	16:16
clarkb	the error occurs when doing POST /v3/auth/tokens you get back a 401	16:17
clarkb	fungi: I have not. The extent of my debugging is above	16:17
fungi	half suspecting that if we log into the webui once, everything else will start working again	16:18
fungi	but i'm hesitant to do that if we want to give one of the flex folks an opportunity to take a look first	16:18
clarkb	oh right seems like you hit similar when we first set that up	16:18
fungi	right, that's why i'm coming back to it	16:19
clarkb	cardoe: ^ fyi. Doesn't look like klamath is in the channel right now	16:26
frickler	ok, so I'll restart (stop+start) the schedulers on zuul01+02 now. if the web services need a restart, that's find to wait another two days I'd think	16:26
cardoe	Looking.	16:26
clarkb	frickler: the important thing is you do them one at a time and wait for them to fully come up before doing the next one	16:26
frickler	yes, sure	16:26
clarkb	frickler: I think it may take up to like 20 minutes for a scheduler to fully come up	16:26
clarkb	the https://zuul.opendev.org/components list is a good way to verify it is up after hte logs look good	16:28
clarkb	oh ya that is what the playbook does before proceeding too	16:30
frickler	ok, if I reads those playbooks correctly, it does "cd /etc/zuul-scheduler/; docker-compose down" and then "... up -d", right?	16:32
clarkb	frickler: yes that is my read of the ansible too	16:32
frickler	with some sudo added. ok, will do that on zuul01 now	16:33
clarkb	the components list shows 01 is initializing so thats good. Now we wait	16:35
clarkb	frickler: note that did end up picking up one new commit in zuul. But that commit is a css only update so shouldn't affect anything outside of browsers until we restart webs. And even then the impacti s minimal	16:39
clarkb	components list shows the scheduler is running now. That was faster than I expected.	16:41
frickler	ok, logs also look normal to me, doing zuul02 now	16:48
frickler	jobs are running now for https://zuul.opendev.org/t/openstack/status?change=938776 after a recheck, so the new github token is working better, too	16:53
frickler	#status log restarted zuul-scheduler on zuul01+02 in order to pick up a new github api token	16:54
opendevstatus	frickler: finished logging	16:54
clarkb	thank you for taking care of that	17:01
frickler	logs on zuul02 also look fine, so I consider this topic done for now	17:01
fungi	thanks!	17:03
opendevreview	Clark Boylan proposed opendev/system-config master: Update to Gitea 1.23 https://review.opendev.org/c/opendev/system-config/+/938826	17:05
clarkb	if ^ passes I'll work on holding a node so that we can check it more interactively	17:05
cardoe	clarkb: did you create a newer account and are trying to use it in that project?	17:48
clarkb	cardoe: no	17:49
cardoe	They did some swizzling that newer accounts do not get some default permission thing to see projects.	17:49
clarkb	its the same credentials that we use against rax classic. fungi ran into similar when we first set this up and apparently logging into skyline at the time got things connect between the two system.	17:49
fungi	yeah, sounds like whatever broke this was back in november, we only just noticed we can't authenticate to it	17:51
cardoe	What are the actual usernames and/or tenant ID?	18:21
frickler	cardoe: broken one is username: 'openstackci', project_id: 'ed36be9ee2a84911982e95636a85698c'	18:42
frickler	for comparison, the working one is openstackjenkins / f063ac0bb70c486db47bcf2105eebcbd	18:42
fungi	note that we ended up using the local cloud project ids because of the sync problem i hit during early setup where the nnnnn/Flex federated ids were rejected by the api until i first logged into skyline	18:54
opendevreview	Clark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181	18:56
fungi	i think something may be amiss with nl01, i'm getting random connection resets trying to ssh into it, but sometimes i get through	18:57
clarkb	tkajinam: do you still need your heat-functional held node?	18:57
clarkb	fungi: nl01 pings from here lgtm over ipv4. Could it be ipv6 specific?	18:58
fungi	maybe	18:58
fungi	nope	18:58
frickler	nl01 works fine for me over ipv6 (by default, didn't explicitly choose it)	18:58
fungi	fungi@dhole:~$ ossh -4 nl01	18:58
fungi	Connection closed by 104.239.144.109 port 22	18:58
fungi	seems i get through about 50% of the time regardless of whether i'm going via v4 or v6	18:59
fungi	and nl02 works every time	18:59
frickler	I do see the same when forcing v4	19:00
fungi	0% icmp packet loss over both v4 and v6	19:00
fungi	guessing it's something to do with state handling on its hypervisor host	19:00
fungi	probably a noisy neighbor filling up the conntrack tables or something	19:01
clarkb	I just did ~10 ssh connections with no problem	19:01
fungi	weird	19:01
fungi	then i guess it could be further into the network layer	19:01
fungi	maybe their core flow handling has my sessions balanced 50% onto a bad switch and not yours or something	19:02
clarkb	ya its interesting that it reports the remote is closing the connection	19:03
clarkb	but I guess that could happen due to being unhappy about asymetrically routed packets	19:03
fungi	it could be just about anything. lots of layer 3/4 middleboxen will spoof tcp/rst packets on behalf of the peer	19:04
fungi	including poorly-behaved rate limiters and packet shapers	19:04
frickler	since clarkb logged it, I'm not getting failures anymore either. problem fixed ;)	19:05
fungi	hah	19:05
frickler	s/it/in/	19:05
fungi	and yeah, it's cleared up for me now too, whatever it was	19:05
fungi	maybe the problem got routed around	19:06
fungi	or a flood somewhere died off	19:06
fungi	splitting here as a tangent from a discussion in the zuul matrix channel, is anyone else surprised that a held jammy node shows an installed version of podman from the original jammy release rather than a newer version available in jammy-security and jammy-updates?	19:12
clarkb	does the job specify the version maybe?	19:13
fungi	oh maybe... dpkg.log shows it being installed during the job at least, so it's not preinstalled	19:15
fungi	yep	19:17
fungi	podman=3.4.4+ds1-1ubuntu1	19:17
fungi	https://review.opendev.org/c/zuul/zuul-jobs/+/886552 Pin podman on jammy	19:20
fungi	oh, right, that was because of https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2024394	19:20
cardoe	fungi/clarkb: auth should be fixed	19:50
fungi	thanks cardoe!!!	19:50
fungi	frickler: ^	19:51
clarkb	cardoe: fungi: confirmed osc server list is working for me now	19:51
cardoe	So I misspoke to the "new accounts to not see projects"	19:51
clarkb	cardoe: would logging in through skyline have made a difference? Mostly curious if that thought was relevant	19:55
cardoe	Essentially there's some backend side this account is authorized for this product type thing. When they started Flex it was just using the existing rax openstack value. They made them have a new value that says flex so you had to have that to auth against flex. They tried to use some heuristics as to who used flex and should get it. And they missed.	19:57
cardoe	So skyline usage wouldn't have fixed it.	19:57
clarkb	gotcha	19:59
frickler	just strange that it seems to have been working for some time, then started failing like 6 weeks ago	20:02
frickler	ah that's likely the point where that heuristic got applied?	20:04
fungi	interestingly my personal rackspace account which has server instances in both rax classic and flex seems to be working fine	20:05
fungi	so i guess we just got lucky with the control plane account	20:05
fungi	(un)lucky	20:05
frickler	anyway, good that it is working now and I'll recheck the results of the periodic jobs tomorrow	20:06
clarkb	https://200.225.47.39:3081/opendev/system-config is the held gitea 1.23.0	20:48
clarkb	not sure if we want to try and get that in quickly due to it theoretically fixing the memroy leak or wait for a .1. Typically they have a .1 not too long after the .0 release	20:48
clarkb	tonyb: I've gotten to the point of getting ready to boot a Noble paste02 server and I don't see the noble image in rax dfw (or ord and iad). I thought you had converted and uploaded noble images everywhere. Am I misremembering?	20:54
clarkb	Looks like on july 25th we discussed this and https://paste.opendev.org/raw/b8QSrgmqdHkbHY0S6801/ was going to be uploaded	21:01
clarkb	but I'm not seeing those images when running image list or image list --shared	21:02
clarkb	makes me wonder if the upload failed or maybe we didn't think we had to do the dance to upload via glance tasks but we do? or maybe the images were removed?	21:03
jaltman	>***fungi sighs at https://bugs.debian.org/1089501 having been open for weeks now with no confirmation (also i hunted for upstream bug reports and changes in their gerrit, but no dice)	21:04
jaltman	https://www.openafs.org/dl/openafs/1.8.13.1/RELNOTES-1.8.13.1 released on 19-Dec-2024	21:04
jaltman	Linux clients * Add support for Linux 6.12 (15964, 15965, 15966)	21:04
clarkb	tonyb: I do see the image in other clouds that I've spot checked. Hoping you acn remember what happened there.	21:05
fungi	jaltman: thanks! i hadn't spotted the .1 release	21:06
fungi	i'll prod the debian package maintainer	21:06
clarkb	I see the vhd image file in tonyb's homedir on bridge so it did get created at least	21:06
clarkb	s/created/converted from raw/	21:06
clarkb	I'm suddenly worried we're going to go through quite the experience just to upload an image to rax. But we know it is doable because nodepool does it	21:11
clarkb	cloudnull not sure if this is on your radar yet but rax classic doesn't have Ubuntu Noble images yet (I thought we had uploaded our own, but doesn't look like we did hence my discovery the cloud doesn't provide any yet either)	21:15
cardoe	frickler: yeah 6 weeks sounds about right	21:18
opendevreview	Merged zuul/zuul-jobs master: Add ability to exclude a specific platform https://review.opendev.org/c/zuul/zuul-jobs/+/926164	22:55
opendevreview	Merged zuul/zuul-jobs master: ensure-podman: add tasks to configure socket group https://review.opendev.org/c/zuul/zuul-jobs/+/925916	22:55
clarkb	I'm not sure if tonyb is around today or not but I think he is on vacation starting tomorrow? I guess I'll give until then for feedback on the image otherwise I'll try to convert a current image to vhd and then proceed to attempting to upload it to the three rax classic regions	22:56
clarkb	I've started brainstorming a rough outline of topics for a january opendev meetup in https://etherpad.opendev.org/p/opendev-january-2025-meetup	23:41
clarkb	right now there isn't really an order I'm just braindumping topics. Feel free to add your own and then as we get closer to a schedule/plan I'll try to organize things	23:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!