Monday, 2025-06-23

opendevreview	tianyutong proposed openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework https://review.opendev.org/c/openstack/project-config/+/953069	01:41
mnasiadka	morning	06:46
mnasiadka	I see the devstack based testing for DIB is moving forward, nice - tonyb if you need any help - just shout :)	06:47
mnasiadka	clarkb: I'm afraid there are no updates for the boot partition GUID in stream 10 - should I just work it around for centos element, or leave a reno that it's broken for now?	06:49
*** liuxie is now known as liushy		07:24
*** dhill is now known as Guest18614		11:07
opendevreview	Merged openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework https://review.opendev.org/c/openstack/project-config/+/953069	12:31
clarkb	mnasiadka: we can probably work around it if there is no movement. I think the approach I would take is if we find no explicit / but there is a different bootable partition then we assume that is /	15:01
clarkb	but I'd have to look at the dib code again to be sure that is the right approach. there are several partition types it is trying to track and maybe some other appraoch is more accurate	15:05
clarkb	the ord backup server needs pruning	15:06
clarkb	I'm going to do weekly local updates then after that the main things on my todo list for today are the gitea 1.24 upgrade if we're comfortable with that and booting some new noble servers for mirror update and zookeeper. corvus I had a thought this morning that I could push a zuul executor on noble test change and if that passes boot a new noble ze canary node too. If I do that what	15:07
clarkb	digits would you like me to give that server (ze01 and replace existing servers or ze13 and start appending to the list?)	15:07
fungi	worth noting the replacements booted last week just reused the same names as the originals. given that, cycling back to ze01 should also work fine	15:13
opendevreview	Clark Boylan proposed opendev/system-config master: Test zuul executor deployment on noble https://review.opendev.org/c/opendev/system-config/+/953127	15:19
clarkb	fungi: re gitea 1.24 did you get a chance to check the held node? Looks like you +2'd the change. Just wondering if we should proceed now or if there is more testing to be done. I haven't checked the 1.24.2 held node yet but will do so momentarily	15:20
clarkb	quickly poking around the web ui looks like 1.24.1 did. I don't see anything immediately concerning. I also ran a clone that worked too and the resulting git repo locally has the content I expect so ya this looks fine to me if we think we're ready to proceed	15:23
clarkb	corvus: alternatively, maybe you'd prefer we replace schedulers first?	15:25
fungi	clarkb: yeah, held node looked fine to me	15:31
Clark[m]	I'm grabbing something for breakfast really quickly but afterwards I guess we approve the gitea upgrade if there is no additional feedback?	15:32
fungi	yeah, i'm on hand whenever, just in conference calls for another half an hour	15:32
mnasiadka	clarkb: I’ll update the patch tomorrow and rebase it on tonyb’s chain	15:52
clarkb	dan_with: two non urgent observations from our side on the raxflex cloud. First is that our nodepool service is leaking floating ips. This is a long standing openstack issue and not something specific to your deployment. It is expected and we even have cleanup routines to clear them out. Unfortunately, those are currently disabled because we have two competing tools running at the	16:00
clarkb	moment that don't coordinate and they would delete each other's floating ips if we enabled the functionality. But it has me wondering if there is any interest in trying to find the reason for why the openstack apis say the fip is deleted when we attempt to delete them the first time and yet they stick around. The other thing is I'm manually deleting leaked floating ips that I've	16:00
clarkb	checked won't break anything any the api response times from sjc3 are pretty slow. dfw3 was really quick in comparison. Not sure if that is a known issue on your end but that may be worth checking too?	16:00
clarkb	jamesdenton__: ^ fyi	16:00
jamesdenton__	reading	16:00
jamesdenton__	thanks for the feedback, will check that out	16:02
clarkb	jamesdenton__: our node deletion process is to delete the fip and wait for oepnstack to say it is deleted (I can't remember if this is a nova or neutron api call but we could look it up). Then afterwards we delete the node. In the case of these leaks we believe that fip delete is "succeeding" such that we proceed with deleting the node but the fip hasn't actually gone away. This is	16:04
clarkb	behavior we have observed in openstack for a long long time so nothing new on your end. More just wonderingi f there is interest in trying to debug further since yall are using fips in your cloud setup	16:04
jamesdenton__	yes, most definitely	16:04
jamesdenton__	also curious what your delete strategy is so we might try and replicate	16:05
corvus	clarkb: i pushed https://review.opendev.org/952697 and https://review.opendev.org/952698 a bit ago... i'm happy to continue the nobel upgrade for zuul	16:06
clarkb	corvus: oh perfect. I can abandon my change. And thanks!	16:07
clarkb	jamesdenton__: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L166-L196 this is the state machine code that we use	16:07
corvus	np! i think i might do schedulers this week and then executors next weekend; try to do them all at once so the process doesn't drag out	16:08
clarkb	looking at the code there I suppose it is possible that adapter._hasFloatingIps() is returning false and we're relying on nova to automatically clean up the fips rather than trying to do it ourselves?	16:09
clarkb	corvus: ^ you probably have a better sense of how likely that is. We could add some simple logging maybe to dobule check	16:10
clarkb	hrm looks like that is asking openstacksdk for any answer	16:10
clarkb	ok I think I've got fip cleanup well underway. I'm going to approve the change to upgrade gitea. The testing for that should take about an hour if there are any objections or concerns we can unapprove/wip the change	16:15
clarkb	infra-root (and others) I've been distracted hte last couple of weeks, but do plan to put a meeting agenda together today and chair the meeting tomorrow. Let me know if there are updates that should be made to the agenda. It is very likely I have missed things	16:18
fungi	clarkb: i think there are probably a few older agenda items that could come off	16:22
clarkb	ack I've made a ntoe to clean thinsg up too	16:24
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	16:49
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	16:49
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	16:50
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	16:51
corvus	clarkb: with rax flex, we call delete_floating_ip	16:56
corvus	2025-06-23 16:55:17,914 DEBUG nodepool.OpenStackAdapter.raxflex-dfw3: API call delete_floating_ip in 0.3182524172589183	16:56
corvus	so that means the full sequence you highlighted in https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L166-L196 should be run	16:57
clarkb	corvus: cool thank you for confirming	17:02
clarkb	so its almost like we delete the fip, it goes away so we proceed then sometime later the cloud remembers that the fip is still there? Zombie floating ips	17:02
corvus	yeah; i don't think we can tell from the logs whether the fip was reported "DOWN" or whether it disappeared from the list	17:02
corvus	we consider either of those sufficient to proceed	17:03
corvus	i don't know whether we have seen both behaviors, or whether we only ever see "DOWN" and the other is just a safety valve	17:03
corvus	but maybe we see both? and maybe that could be an important detail? i dunno.	17:04
corvus	oh, sorry, to add context in case anyone isn't aware: the way that zuul/nodepool update states on things like servers or floating ips, is that they list the objects. so we do one api call to update everything at once.	17:05
corvus	that's what i mean when i say "disappeared from the list" i mean the results of "list_floating_ips"	17:06
corvus	so the condition for us to proceed is either 1) the status of the fip in question is reported as "DOWN" in list_floating_ips; or 2) the fip in question no longer appears in list_floating_ips	17:07
clarkb	and ya it seems likely that only one of those two conditions lead to leaking the fip later (possibly the down state rather than full removal?)	17:08
clarkb	since it would be weird for somethign to be completely delisted (deleted) then show up again later	17:08
frickler	do we detach the fip from the instance before we delete either?	17:09
clarkb	no I don't think so	17:10
clarkb	the first step in that state machine loop is to send the delete request	17:10
clarkb	https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L173 there	17:10
frickler	ok, that might be an important detail in trying to reproduce this. I also don't exactly understand why we consider fip_status=down as deleted	17:15
fungi	good point, downed fips probably still count against the fip count/quota	17:16
clarkb	I suspect because when you delete an fip one of the steps before final removal is marking it down? So this allows us to proceed earlier (but maybe that leads to the leak?)	17:16
clarkb	I've started booting mirror-update03 while waiting on the gitea upgrade to gate	17:18
clarkb	once dns is updated and the system-config change is ready for that we can start to quiesce mirror-update02 to avoid conflicts and land the 03 system-config change	17:19
clarkb	and then late this week I'd like to do the gerrit restart stuff I've been punting on due to other distractions	17:20
clarkb	(I'm saying that out loud here now as that should help me actually do those tasks)	17:21
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add mirror-update03 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953139	17:28
opendevreview	Clark Boylan proposed opendev/system-config master: Add mirror-update03 to the inventory https://review.opendev.org/c/opendev/system-config/+/951163	17:35
clarkb	I realized that we currently have separate host vars files for mirror-update01 and mirror-update02. I don't think there is any reason we need to do that so ^ updates docs to use a group vars file instead and I'll update private vars before 951163 lands	17:36
fungi	yeah, i think whem 02 was created the opportunity to switch to group vars was missed	17:38
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Point mirror-update to mirror-update03 https://review.opendev.org/c/opendev/zone-opendev.org/+/953140	17:38
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Remove mirror-update02 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953141	17:38
clarkb	ok I think all of that prep is done/ready. The only thing remaining is grabbing locks and/or shutting down mirror-update02 and code reviews. I'll wait for gitea before I start looking at that though as gitea should be soon	17:41
opendevreview	Merged opendev/zone-opendev.org master: Add mirror-update03 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953139	17:42
fungi	zuul thinks the gitea upgrade is 4 minutes from merging, so should put the deploy before the hourly jobs	17:45
fungi	it's finally wrapping up, should still land in time (if only barely)	17:56
opendevreview	Merged opendev/system-config master: Update to Gitea 1.24.2 https://review.opendev.org/c/opendev/system-config/+/948560	17:57
clarkb	here we go	17:58
clarkb	https://gitea09.opendev.org:3081/opendev/system-config/ is running 1.24.2 now	18:02
clarkb	the restart seems a bit slow I suspect there are db migrations (and by that I mean like 40 seconds instead of the usual 5 ish)	18:02
clarkb	not a big deal	18:02
clarkb	git clone from that url works for me too	18:03
fungi	mmm, yeah seems fine to me after	18:05
fungi	from a ui perspective, seems it added a left sidebar for the file tree when looking at file contents	18:08
fungi	which wasn't there in 1.23.8	18:09
clarkb	yup and there are fancier icons for files and directories it recognizes too	18:09
clarkb	09-12 are done now.	18:09
fungi	sadly zuul's custom sphinx directives still cause challenges for the rst rendering	18:10
fungi	https://gitea09.opendev.org:3081/zuul/zuul-jobs/src/branch/master/roles/ensure-nox/README.rst	18:10
fungi	though i think it's just that it doesn't know that zuul:rolevar content bodies are printable	18:12
clarkb	ya I think the renderer hits an error and just stops proceeding with the render?	18:12
fungi	not exactly, because in that example it skipped the first zuul:rolevar content and printed the heading that came after it too	18:14
fungi	it's done deploying on gitea14 now	18:14
clarkb	13 and 14 are done now. The job should complete soon	18:14
clarkb	ah	18:14
clarkb	https://zuul.opendev.org/t/openstack/buildset/b286c574dd2d4aba8f254830f30a3ffc success	18:15
* fungi cheers		18:15
clarkb	the last major thing to check is probably gerrit replication	18:15
fungi	we shoulda saved the mirror-update03 dns addition for this	18:16
clarkb	I think https://gitea09.opendev.org:3081/starlingx/nfv/commit/43479af7b5e851f8e18d26cbaf865e7e1d0ff758 was pushed to gerrit after gitea09 updated	18:17
clarkb	https://review.opendev.org/c/starlingx/nfv/+/953128 its this change	18:17
clarkb	ps6 was uploaded at 18:09 according to gerrit timestamps and 18:02 ish is when I noted that gitea09 was upgraded so I think we're probably good with replication too	18:17
fungi	yeah, that checks out	18:18
clarkb	any concerns with me starting tograb mirror-update02 locks in a screen on that server for eventual landing of the mirror-update03 change later today?	18:19
clarkb	I'm thinking it may take a bit to grab all the locks and it is almost lunch time here and I'm not sure if I have reviews yet so not sure if now is a good time to start	18:19
clarkb	I don't think there are any important mirror updates that we are waiting on either though whcih is probably the biggest thing	18:19
clarkb	looking at crontab -l there are a couple of cron jobs that don't flock as well taht we need to turn off (this will require putting 02 in the emergency file)	18:21
clarkb	but I think I start with all the flocks. Then when I've got them all I can worry about that last piece of collision avoidance	18:22
clarkb	https://gitea09.opendev.org:3081/starlingx/nfv/commit/ffa556abd7475cc2eab58ae3e15de94a296f4274 thsi one was after all 6 giteas updated (same change new patchset)	18:22
clarkb	https://opendev.org/starlingx/nfv/commit/ffa556abd7475cc2eab58ae3e15de94a296f4274 a better link	18:23
fungi	i wonder if it would make more sense to shut down the server?	18:26
clarkb	fungi: yes I think so. But to do that safely I think we have to lock everything first	18:27
clarkb	but then maybe I don't have to edit the crontab	18:27
clarkb	I have 13 / 16 flocks held in a root screen on 02	18:27
clarkb	the other three are debian, ubuntu, and ubuntu-ports and they are running	18:27
fungi	or just shut down the cron service and wait for everything to finish	18:27
clarkb	oh ya that should work too	18:28
fungi	holding locks makes sense when it's one or a handful of things we're trying to block, but when it's everything there are probably less labor-intensive solutions	18:28
clarkb	ya though I wonder if cronjobs that fire and have no crond to report back to end up zombifying? I can grab the flocks to prevent new jobs then when that is done we can stop whatever the cron service is on ubuntu	18:29
fungi	should probably also double-check first that everything is actually healthy, rather than trying to debug a problem after the server replacement that turns out to have been present since before we started	18:29
clarkb	fungi: in afs you mean? That si a good idea	18:29
fungi	yeah, i'm looking now	18:29
fungi	according to grafana most of the mirrors finished around 15-16 hours ago	18:31
fungi	i kinda thought we were updating them more frequently than that	18:31
fungi	oh, never mind	18:32
fungi	i was mixing up the wheel updates with the distro package mirrors	18:32
clarkb	ah	18:32
clarkb	debian just finished so I grabbed its lock	18:32
fungi	everything that's configured to update has done so within the past 6 hours, yeah	18:32
fungi	so looks like we should be fine	18:33
fungi	a few of the mirrors are skating close to quota, but nothing looks completely full	18:34
clarkb	the three non flock'd jobs are afsmon (reporting the stats grafana renders), docs/tarballs/etc release, and the publication of mirror-update logs to afs	18:40
clarkb	for that last item we're rsyncing using -rltvz from the server /var/log/$location to related afs dirs	18:41
clarkb	I don't think we need to spend a bunch of energy trying to coalesce old and new logs there. I think its fine to just sstop 02 from writing and let 03 start writing?	18:42
fungi	yeah, i'm okay with that	18:42
fungi	we'll lose something like a few days of log history if we do nothing	18:42
fungi	worst case we can keep the old server for a few days in case anyone needs "old" logs	18:43
clarkb	ya	18:43
clarkb	I'm going to find lunch while I'm waiting for ubuntu and ubuntu-ports to finish	18:51
clarkb	ok I have all 16 flocks held in the root screen	19:23
clarkb	still finishing up lunch though. Not sure if you want to proceed with system shutdwon / crond stoppage	19:23
clarkb	ok I'm ready now I think	20:22
clarkb	mirror-update02 is in the emergency file now	20:24
clarkb	fungi: do you think `systemctl stop cron.service && systemctl disable cron.service` is sufficient on mirror-update02 for us ot proceed with adding 03 to the inventory?	20:25
fungi	yes	20:26
fungi	we can also check for the crond process in ps	20:26
fungi	but that should suffice to keep it from restarting if the server gets accidentally rebooted	20:27
clarkb	done	20:29
clarkb	thats in window 0 of my root screen on 02. Do you want to approve the 03 inventory addition change when you're happy with 02's state?	20:30
fungi	attached	20:31
fungi	looks clear to me	20:31
clarkb	cool do you want to +A or should I?	20:32
clarkb	https://review.opendev.org/c/opendev/system-config/+/951163 this change	20:32
fungi	i can	20:32
clarkb	thanks!	20:36
clarkb	I've done a quick first pass on the meeting agenda. Cleaned up some old content and updated content I have enough context for (dib centos 10 and replacing old servers in particular)	20:47
opendevreview	Merged opendev/system-config master: Add mirror-update03 to the inventory https://review.opendev.org/c/opendev/system-config/+/951163	21:08
clarkb	the hourly jobs are just finishing up then ^ should deploy	21:08
clarkb	the mirror-update job is running now. I expect it to be slower than usual as it dkms builds openafs	21:18
clarkb	infra-prod-service-afs failed and I think it is because the setup of the vos release ssh key between mirror-udpate and afs servers means mirror-update needs to run before the afs playbook	21:21
clarkb	I'm going to dig into that a bit more now and if that is the fix we should be able to push a change that sets up that relationship that redeploys things and makes them happy	21:22
fungi	yeah, there's no rush on this, if anyone notices docs pages not updating or whatever we can explain	21:27
clarkb	ok the issue is that mirror-update02 is in the emergency file so its ssh keypair wasn't registered and we had an assertion fail	21:28
clarkb	I guess we can remove it from the emergency file and let things run, remove it from the inventory and let things run, or temporarily remove that assertion?	21:29
clarkb	I think that it should be safe to remove it from the emergency file unless we ensure that cron is an enabled running service somewhere	21:29
clarkb	alternatively maybe we just remove it from the inventory but don't delete the server yet	21:30
clarkb	actually that may be the best option. We can then add it back to the inventory if we have to rollback. But removing it from the inventory will cause the jobs to rerun too and apply that ssh key which is nice	21:30
fungi	seems safe to remove it completely, yes	21:31
opendevreview	Clark Boylan proposed opendev/system-config master: Remove mirror-update02 from our ansible inventory https://review.opendev.org/c/opendev/system-config/+/953153	21:33
clarkb	note mirror-update03 is going to start running cronjobs nowish but they will all fail due to ssh not working	21:34
clarkb	that should be ok	21:34
clarkb	they will just do work then vos release will have an ssh error until we get 953153, something like 953153 or manually set up authorized keys on the afs server(s)	21:34
fungi	this might also be the explanation for why they weren't using group vars originally	21:34
clarkb	fungi: maybe, though the ansible configuring the keys is using the group which is why we're hitting this	21:35
clarkb	its basically doing a loop over every mirror-update group member and attempting to ensure the ssh key on those servers is in authorized keys on afs servers	21:35
clarkb	but since mirror-update02 is in the emergency file we never loaded the key material to pass to the other servers	21:35
clarkb	in any case I think 953153 will address the issue since that removes mirror-update02 from the group by not having it in our inventory then we won't try to loop over it anymore	21:36
clarkb	ya if you look at /var/log/afs-release/afs-release.log on mirror-update03 it reports failures to release due to ssh	21:42
clarkb	this is expected given my understanding of the situation. I'm just calling it out that things are failing as expected and it doesn't seem to be a problem. We should be able to confirm via that log file that things are working once the fix change deploys	21:43
clarkb	it runs every 5 minutes so is a good canary	21:43
clarkb	do we want to enqueue 953153 directly to the gate to speed it up? Or just let it roll as is	21:43
fungi	i'm about to start cooking dinner, but good with either	21:44
clarkb	I'll enqueue it since I'm impatient	21:45
fungi	sgtm, thanks!	21:45
opendevreview	Merged opendev/system-config master: Remove mirror-update02 from our ansible inventory https://review.opendev.org/c/opendev/system-config/+/953153	21:54
clarkb	the afs deployment job is running for ^ now which should fix things for us	22:05
clarkb	the job finished successfully. At 22:10 the cronjob to release docs/tarballs/etc will run and should confirm things are happy now	22:07
clarkb	hrm it still failed. I suspect the problem now is known_hosts	22:10
clarkb	yes I think that was it based on a manual run of ssh. I went ahead and manually added the known_host entry and we'll see if the 22:15 run is happy	22:12
clarkb	it is running vos release docs right now	22:15
clarkb	2025-06-23 22:16:19,823 release DEBUG Released volume docs successfully	22:17
clarkb	I think we are good now	22:17
clarkb	ubuntu and ubuntu-ports are running updates too which should give us good indication from the larger volumes as well	22:17
clarkb	and cloud archive just started	22:18
clarkb	Released volume mirror.ubuntu-cloud successfully	22:20
clarkb	once others are happy with the new server (I haven't checked any rsync mirroring results but reprepro and vos release look good on initial inspection) the next step is landing https://review.opendev.org/c/opendev/zone-opendev.org/+/953140 then deciding if/how/when to delete mirror-update02. I think with cron.service disabled we're ok to leave it running for a bit	22:21
clarkb	https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc I think the data updates on here show that afsmon is working too? Also you can see after I disabled and stopped cron.service on 02 that we stopped getting vos release timer info until the new server booted	22:23
fungi	awesome. thanks for working it through!	22:24
clarkb	all of the rsync mirroring scripts are on */6 hourly schedules	22:27
clarkb	which means they will start during the midnight UTC hour (they have varied minute rules so are spread out through that hour)	22:27
clarkb	ok last call on the meeting agenda. I think I've got all the edits in that I am aware of. Let me know if anything is missing and I can add it. Otherwise I'll get that sent out in 15-20 minute	22:45
clarkb	and sent	23:06
clarkb	centos stream mirror should be first starting at 0006	23:34
clarkb	I'll check that is happy before I sign off for the day	23:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!