Monday, 2025-06-23

opendevreviewtianyutong proposed openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework  https://review.opendev.org/c/openstack/project-config/+/95306901:41
mnasiadkamorning06:46
mnasiadkaI see the devstack based testing for DIB is moving forward, nice - tonyb if you need any help - just shout :)06:47
mnasiadkaclarkb: I'm afraid there are no updates for the boot partition GUID in stream 10 - should I just work it around for centos element, or leave a reno that it's broken for now?06:49
*** liuxie is now known as liushy07:24
*** dhill is now known as Guest1861411:07
opendevreviewMerged openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework  https://review.opendev.org/c/openstack/project-config/+/95306912:31
clarkbmnasiadka: we can probably work around it if there is no movement. I think the approach I would take is if we find no explicit / but there is a different bootable partition then we assume that is /15:01
clarkbbut I'd have to look at the dib code again to be sure that is the right approach. there are several partition types it is trying to track and maybe some other appraoch is more accurate15:05
clarkbthe ord backup server needs pruning15:06
clarkbI'm going to do weekly local updates then after that the main things on my todo list for today are the gitea 1.24 upgrade if we're comfortable with that and booting some new noble servers for mirror update and zookeeper. corvus I had a thought this morning that I could push a zuul executor on noble test change and if that passes boot a new noble ze canary node too. If I do that what15:07
clarkbdigits would you like me to give that server (ze01 and replace existing servers or ze13 and start appending to the list?)15:07
fungiworth noting the replacements booted last week just reused the same names as the originals. given that, cycling back to ze01 should also work fine15:13
opendevreviewClark Boylan proposed opendev/system-config master: Test zuul executor deployment on noble  https://review.opendev.org/c/opendev/system-config/+/95312715:19
clarkbfungi: re gitea 1.24 did you get a chance to check the held node? Looks like you +2'd the change. Just wondering if we should proceed now or if there is more testing to be done. I haven't checked the 1.24.2 held node yet but will do so momentarily15:20
clarkbquickly poking around the web ui looks like 1.24.1 did. I don't see anything immediately concerning. I also ran a clone that worked too and the resulting git repo locally has the content I expect so ya this looks fine to me if we think we're ready to proceed15:23
clarkbcorvus: alternatively, maybe you'd prefer we replace schedulers first?15:25
fungiclarkb: yeah, held node looked fine to me15:31
Clark[m]I'm grabbing something for breakfast really quickly but afterwards I guess we approve the gitea upgrade if there is no additional feedback?15:32
fungiyeah, i'm on hand whenever, just in conference calls for another half an hour15:32
mnasiadkaclarkb: I’ll update the patch tomorrow and rebase it on tonyb’s chain15:52
clarkbdan_with: two non urgent observations from our side on the raxflex cloud. First is that our nodepool service is leaking floating ips. This is a long standing openstack issue and not something specific to your deployment. It is expected and we even have cleanup routines to clear them out. Unfortunately, those are currently disabled because we have two competing tools running at the16:00
clarkbmoment that don't coordinate and they would delete each other's floating ips if we enabled the functionality. But it has me wondering if there is any interest in trying to find the reason for why the openstack apis say the fip is deleted when we attempt to delete them the first time and yet they stick around. The other thing is I'm manually deleting leaked floating ips that I've16:00
clarkbchecked won't break anything any the api response times from sjc3 are pretty slow. dfw3 was really quick in comparison. Not sure if that is a known issue on your end but that may be worth checking too?16:00
clarkbjamesdenton__: ^ fyi16:00
jamesdenton__reading16:00
jamesdenton__thanks for the feedback, will check that out16:02
clarkbjamesdenton__: our node deletion process is to delete the fip and wait for oepnstack to say it is deleted (I can't remember if this is a nova or neutron api call but we could look it up). Then afterwards we delete the node. In the case of these leaks we believe that fip delete is "succeeding" such that we proceed with deleting the node but the fip hasn't actually gone away. This is16:04
clarkbbehavior we have observed in openstack for a long long time so nothing new on your end. More just wonderingi f there is interest in trying to debug further since yall are using fips in your cloud setup16:04
jamesdenton__yes, most definitely16:04
jamesdenton__also curious what your delete strategy is so we might try and replicate16:05
corvusclarkb: i pushed https://review.opendev.org/952697 and https://review.opendev.org/952698 a bit ago... i'm happy to continue the nobel upgrade for zuul16:06
clarkbcorvus: oh perfect. I can abandon my change. And thanks!16:07
clarkbjamesdenton__: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L166-L196 this is the state machine code that we use16:07
corvusnp!  i think i might do schedulers this week and then executors next weekend; try to do them all at once so the process doesn't drag out16:08
clarkblooking at the code there I suppose it is possible that adapter._hasFloatingIps() is returning false and we're relying on nova to automatically clean up the fips rather than trying to do it ourselves?16:09
clarkbcorvus: ^ you probably have a better sense of how likely that is. We could add some simple logging maybe to dobule check16:10
clarkbhrm looks like that is asking openstacksdk for any answer16:10
clarkbok I think I've got fip cleanup well underway. I'm going to approve the change to upgrade gitea. The testing for that should take about an hour if there are any objections or concerns we can unapprove/wip the change16:15
clarkbinfra-root (and others) I've been distracted hte last couple of weeks, but do plan to put a meeting agenda together today and chair the meeting tomorrow. Let me know if there are updates that should be made to the agenda. It is very likely I have missed things16:18
fungiclarkb: i think there are probably a few older agenda items that could come off16:22
clarkback I've made a ntoe to clean thinsg up too16:24
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404516:49
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404516:49
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404516:50
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404516:51
corvusclarkb: with rax flex, we call delete_floating_ip16:56
corvus2025-06-23 16:55:17,914 DEBUG nodepool.OpenStackAdapter.raxflex-dfw3: API call delete_floating_ip in 0.318252417258918316:56
corvusso that means the full sequence you highlighted in https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L166-L196 should be run16:57
clarkbcorvus: cool thank you for confirming17:02
clarkbso its almost like we delete the fip, it goes away so we proceed then sometime later the cloud remembers that the fip is still there? Zombie floating ips17:02
corvusyeah; i don't think we can tell from the logs whether the fip was reported "DOWN" or whether it disappeared from the list17:02
corvuswe consider either of those sufficient to proceed17:03
corvusi don't know whether we have seen both behaviors, or whether we only ever see "DOWN" and the other is just a safety valve17:03
corvusbut maybe we see both?  and maybe that could be an important detail?  i dunno.17:04
corvusoh, sorry, to add context in case anyone isn't aware: the way that zuul/nodepool update states on things like servers or floating ips, is that they list the objects.  so we do one api call to update everything at once.17:05
corvusthat's what i mean when i say "disappeared from the list" i mean the results of "list_floating_ips"17:06
corvusso the condition for us to proceed is either 1) the status of the fip in question is reported as "DOWN" in list_floating_ips; or 2)  the fip in question no longer appears in list_floating_ips17:07
clarkband ya it seems likely that only one of those two conditions lead to leaking the fip later (possibly the down state rather than full removal?)17:08
clarkbsince it would be weird for somethign to be completely delisted (deleted) then show up again later17:08
fricklerdo we detach the fip from the instance before we delete either?17:09
clarkbno I don't think so17:10
clarkbthe first step in that state machine loop is to send the delete request17:10
clarkbhttps://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L173 there17:10
fricklerok, that might be an important detail in trying to reproduce this. I also don't exactly understand why we consider fip_status=down as deleted17:15
fungigood point, downed fips probably still count against the fip count/quota17:16
clarkbI suspect because when you delete an fip one of the steps before final removal is marking it down? So this allows us to proceed earlier (but maybe that leads to the leak?)17:16
clarkbI've started booting mirror-update03 while waiting on the gitea upgrade to gate17:18
clarkbonce dns is updated and the system-config change is ready for that we can start to quiesce mirror-update02 to avoid conflicts and land the 03 system-config change17:19
clarkband then late this week I'd like to do the gerrit restart stuff I've been punting on due to other distractions17:20
clarkb(I'm saying that out loud here now as that should help me actually do those tasks)17:21
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add mirror-update03 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/95313917:28
opendevreviewClark Boylan proposed opendev/system-config master: Add mirror-update03 to the inventory  https://review.opendev.org/c/opendev/system-config/+/95116317:35
clarkbI realized that we currently have separate host vars files for mirror-update01 and mirror-update02. I don't think there is any reason we need to do that so ^ updates docs to use a group vars file instead and I'll update private vars before 951163 lands17:36
fungiyeah, i think whem 02 was created the opportunity to switch to group vars was missed17:38
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Point mirror-update to mirror-update03  https://review.opendev.org/c/opendev/zone-opendev.org/+/95314017:38
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Remove mirror-update02 from DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/95314117:38
clarkbok I think all of that prep is done/ready. The only thing remaining is grabbing locks and/or shutting down mirror-update02 and code reviews. I'll wait for gitea before I start looking at that though as gitea should be soon17:41
opendevreviewMerged opendev/zone-opendev.org master: Add mirror-update03 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/95313917:42
fungizuul thinks the gitea upgrade is 4 minutes from merging, so should put the deploy before the hourly jobs17:45
fungiit's finally wrapping up, should still land in time (if only barely)17:56
opendevreviewMerged opendev/system-config master: Update to Gitea 1.24.2  https://review.opendev.org/c/opendev/system-config/+/94856017:57
clarkbhere we go17:58
clarkbhttps://gitea09.opendev.org:3081/opendev/system-config/ is running 1.24.2 now18:02
clarkbthe restart seems a bit slow I suspect there are db migrations (and by that I mean like 40 seconds instead of the usual 5 ish)18:02
clarkbnot a big deal18:02
clarkbgit clone from that url works for me too18:03
fungimmm, yeah seems fine to me after18:05
fungifrom a ui perspective, seems it added a left sidebar for the file tree when looking at file contents18:08
fungiwhich wasn't there in 1.23.818:09
clarkbyup and there are fancier icons for files and directories it recognizes too18:09
clarkb09-12 are done now.18:09
fungisadly zuul's custom sphinx directives still cause challenges for the rst rendering18:10
fungihttps://gitea09.opendev.org:3081/zuul/zuul-jobs/src/branch/master/roles/ensure-nox/README.rst18:10
fungithough i think it's just that it doesn't know that zuul:rolevar content bodies are printable18:12
clarkbya I think the renderer hits an error and just stops proceeding with the render?18:12
funginot exactly, because in that example it skipped the first zuul:rolevar content and printed the heading that came after it too18:14
fungiit's done deploying on gitea14 now18:14
clarkb13 and 14 are done now. The job should complete soon18:14
clarkbah18:14
clarkbhttps://zuul.opendev.org/t/openstack/buildset/b286c574dd2d4aba8f254830f30a3ffc success18:15
* fungi cheers18:15
clarkbthe last major thing to check is probably gerrit replication18:15
fungiwe shoulda saved the mirror-update03 dns addition for this18:16
clarkbI think https://gitea09.opendev.org:3081/starlingx/nfv/commit/43479af7b5e851f8e18d26cbaf865e7e1d0ff758 was pushed to gerrit after gitea09 updated18:17
clarkbhttps://review.opendev.org/c/starlingx/nfv/+/953128 its this change18:17
clarkbps6 was uploaded at 18:09 according to gerrit timestamps and 18:02 ish is when I noted that gitea09 was upgraded so I think we're probably good with replication too18:17
fungiyeah, that checks out18:18
clarkbany concerns with me starting tograb mirror-update02 locks in a screen on that server for eventual landing of the mirror-update03 change later today?18:19
clarkbI'm thinking it may take a bit to grab all the locks and it is almost lunch time here and I'm not sure if I have reviews yet so not sure if now is a good time to start18:19
clarkbI don't think there are any important mirror updates that we are waiting on either though whcih is probably the biggest thing18:19
clarkblooking at crontab -l there are a couple of cron jobs that don't flock as well taht we need to turn off (this will require putting 02 in the emergency file)18:21
clarkbbut I think I start with all the flocks. Then when I've got them all I can worry about that last piece of collision avoidance18:22
clarkbhttps://gitea09.opendev.org:3081/starlingx/nfv/commit/ffa556abd7475cc2eab58ae3e15de94a296f4274 thsi one was after all 6 giteas updated (same change new patchset)18:22
clarkbhttps://opendev.org/starlingx/nfv/commit/ffa556abd7475cc2eab58ae3e15de94a296f4274 a better link18:23
fungii wonder if it would make more sense to shut down the server?18:26
clarkbfungi: yes I think so. But to do that safely I think we have to lock everything first18:27
clarkbbut then maybe I don't have to edit the crontab18:27
clarkbI have 13 / 16 flocks held in a root screen on 0218:27
clarkbthe other three are debian, ubuntu, and ubuntu-ports and they are running18:27
fungior just shut down the cron service and wait for everything to finish18:27
clarkboh ya that should work too18:28
fungiholding locks makes sense when it's one or a handful of things we're trying to block, but when it's everything there are probably less labor-intensive solutions18:28
clarkbya though I wonder if cronjobs that fire and have no crond to report back to end up zombifying? I can grab the flocks to prevent new jobs then when that is done we can stop whatever the cron service is on ubuntu18:29
fungishould probably also double-check first that everything is actually healthy, rather than trying to debug a problem after the server replacement that turns out to have been present since before we started18:29
clarkbfungi: in afs you mean? That si a good idea18:29
fungiyeah, i'm looking now18:29
fungiaccording to grafana most of the mirrors finished around 15-16 hours ago18:31
fungii kinda thought we were updating them more frequently than that18:31
fungioh, never mind18:32
fungii was mixing up the wheel updates with the distro package mirrors18:32
clarkbah18:32
clarkbdebian just finished so I grabbed its lock18:32
fungieverything that's configured to update has done so within the past 6 hours, yeah18:32
fungiso looks like we should be fine18:33
fungia few of the mirrors are skating close to quota, but nothing looks completely full18:34
clarkbthe three non flock'd jobs are afsmon (reporting the stats grafana renders), docs/tarballs/etc release, and the publication of mirror-update logs to afs18:40
clarkbfor that last item we're rsyncing using -rltvz from the server /var/log/$location to related afs dirs18:41
clarkbI don't think we need to spend a bunch of energy trying to coalesce old and new logs there. I think its fine to just sstop 02 from writing and let 03 start writing?18:42
fungiyeah, i'm okay with that18:42
fungiwe'll lose something like a few days of log history if we do nothing18:42
fungiworst case we can keep the old server for a few days in case anyone needs "old" logs18:43
clarkbya18:43
clarkbI'm going to find lunch while I'm waiting for ubuntu and ubuntu-ports to finish18:51
clarkbok I have all 16 flocks held in the root screen19:23
clarkbstill finishing up lunch though. Not sure if you want to proceed with system shutdwon / crond stoppage19:23
clarkbok I'm ready now I think20:22
clarkbmirror-update02 is in the emergency file now20:24
clarkbfungi: do you think `systemctl stop cron.service && systemctl disable cron.service` is sufficient on mirror-update02 for us ot proceed with adding 03 to the inventory?20:25
fungiyes20:26
fungiwe can also check for the crond process in ps20:26
fungibut that should suffice to keep it from restarting if the server gets accidentally rebooted20:27
clarkbdone20:29
clarkbthats in window 0 of my root screen on 02. Do you want to approve the 03 inventory addition change when you're happy with 02's state?20:30
fungiattached20:31
fungilooks clear to me20:31
clarkbcool do you want to +A or should I?20:32
clarkbhttps://review.opendev.org/c/opendev/system-config/+/951163 this change20:32
fungii can20:32
clarkbthanks!20:36
clarkbI've done a quick first pass on the meeting agenda. Cleaned up some old content and updated content I have enough context for (dib centos 10 and replacing old servers in particular)20:47
opendevreviewMerged opendev/system-config master: Add mirror-update03 to the inventory  https://review.opendev.org/c/opendev/system-config/+/95116321:08
clarkbthe hourly jobs are just finishing up then ^ should deploy21:08
clarkbthe mirror-update job is running now. I expect it to be slower than usual as it dkms builds openafs21:18
clarkbinfra-prod-service-afs failed and I think it is because the setup of the vos release ssh key between mirror-udpate and afs servers means mirror-update needs to run before the afs playbook21:21
clarkbI'm going to dig into that a bit more now and if that is the fix we should be able to push a change that sets up that relationship that redeploys things and makes them happy21:22
fungiyeah, there's no rush on this, if anyone notices docs pages not updating or whatever we can explain21:27
clarkbok the issue is that mirror-update02 is in the emergency file so its ssh keypair wasn't registered and we had an assertion fail21:28
clarkbI guess we can remove it from the emergency file and let things run, remove it from the inventory and let things run, or temporarily remove that assertion?21:29
clarkbI think that it should be safe to remove it from the emergency file unless we ensure that cron is an enabled running service somewhere21:29
clarkbalternatively maybe we just remove it from the inventory but don't delete the server yet21:30
clarkbactually that may be the best option. We can then add it back to the inventory if we have to rollback. But removing it from the inventory will cause the jobs to rerun too and apply that ssh key which is nice21:30
fungiseems safe to remove it completely, yes21:31
opendevreviewClark Boylan proposed opendev/system-config master: Remove mirror-update02 from our ansible inventory  https://review.opendev.org/c/opendev/system-config/+/95315321:33
clarkbnote mirror-update03 is going to start running cronjobs nowish but they will all fail due to ssh not working21:34
clarkbthat should be ok21:34
clarkbthey will just do work then vos release will have an ssh error until we get 953153, something like 953153 or manually set up authorized keys on the afs server(s)21:34
fungithis might also be the explanation for why they weren't using group vars originally21:34
clarkbfungi: maybe, though the ansible configuring the keys is using the group which is why we're hitting this21:35
clarkbits basically doing a loop over every mirror-update group member and attempting to ensure the ssh key on those servers is in authorized keys on afs servers21:35
clarkbbut since mirror-update02 is in the emergency file we never loaded the key material to pass to the other servers21:35
clarkbin any case I think 953153 will address the issue since that removes mirror-update02 from the group by not having it in our inventory then we won't try to loop over it anymore21:36
clarkbya if you look at /var/log/afs-release/afs-release.log on mirror-update03 it reports failures to release due to ssh21:42
clarkbthis is expected given my understanding of the situation. I'm just calling it out that things are failing as expected and it doesn't seem to be a problem. We should be able to confirm via that log file that things are working once the fix change deploys21:43
clarkbit runs every 5 minutes so is a good canary21:43
clarkbdo we want to enqueue 953153 directly to the gate to speed it up? Or just let it roll as is21:43
fungii'm about to start cooking dinner, but good with either21:44
clarkbI'll enqueue it since I'm impatient21:45
fungisgtm, thanks!21:45
opendevreviewMerged opendev/system-config master: Remove mirror-update02 from our ansible inventory  https://review.opendev.org/c/opendev/system-config/+/95315321:54
clarkbthe afs deployment job is running for ^ now which should fix things for us22:05
clarkbthe job finished successfully. At 22:10 the cronjob to release docs/tarballs/etc will run and should confirm things are happy now22:07
clarkbhrm it still failed. I suspect the problem now is known_hosts22:10
clarkbyes I think that was it based on a manual run of ssh. I went ahead and manually added the known_host entry and we'll see if the 22:15 run is happy22:12
clarkbit is running vos release docs right now22:15
clarkb2025-06-23 22:16:19,823 release DEBUG    Released volume docs successfully22:17
clarkbI think we are good now22:17
clarkbubuntu and ubuntu-ports are running updates too which should give us good indication from the larger volumes as well22:17
clarkband cloud archive just started22:18
clarkbReleased volume mirror.ubuntu-cloud successfully22:20
clarkbonce others are happy with the new server (I haven't checked any rsync mirroring results but reprepro and vos release look good on initial inspection) the next step is landing https://review.opendev.org/c/opendev/zone-opendev.org/+/953140 then deciding if/how/when to delete mirror-update02. I think with cron.service disabled we're ok to leave it running for a bit22:21
clarkbhttps://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc I think the data updates on here show that afsmon is working too? Also you can see after I disabled and stopped cron.service on 02 that we stopped getting vos release timer info until the new server booted22:23
fungiawesome. thanks for working it through!22:24
clarkball of the rsync mirroring scripts are on */6 hourly schedules 22:27
clarkbwhich means they will start during the midnight UTC hour (they have varied minute rules so are spread out through that hour)22:27
clarkbok last call on the meeting agenda. I think I've got all the edits in that I am aware of. Let me know if anything is missing and I can add it. Otherwise I'll get that sent out in 15-20 minute22:45
clarkband sent23:06
clarkbcentos stream mirror should be first starting at 0006 23:34
clarkbI'll check that is happy before I sign off for the day23:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!