opendevreview | tianyutong proposed openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework https://review.opendev.org/c/openstack/project-config/+/953069 | 01:41 |
---|---|---|
mnasiadka | morning | 06:46 |
mnasiadka | I see the devstack based testing for DIB is moving forward, nice - tonyb if you need any help - just shout :) | 06:47 |
mnasiadka | clarkb: I'm afraid there are no updates for the boot partition GUID in stream 10 - should I just work it around for centos element, or leave a reno that it's broken for now? | 06:49 |
*** liuxie is now known as liushy | 07:24 | |
*** dhill is now known as Guest18614 | 11:07 | |
opendevreview | Merged openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework https://review.opendev.org/c/openstack/project-config/+/953069 | 12:31 |
clarkb | mnasiadka: we can probably work around it if there is no movement. I think the approach I would take is if we find no explicit / but there is a different bootable partition then we assume that is / | 15:01 |
clarkb | but I'd have to look at the dib code again to be sure that is the right approach. there are several partition types it is trying to track and maybe some other appraoch is more accurate | 15:05 |
clarkb | the ord backup server needs pruning | 15:06 |
clarkb | I'm going to do weekly local updates then after that the main things on my todo list for today are the gitea 1.24 upgrade if we're comfortable with that and booting some new noble servers for mirror update and zookeeper. corvus I had a thought this morning that I could push a zuul executor on noble test change and if that passes boot a new noble ze canary node too. If I do that what | 15:07 |
clarkb | digits would you like me to give that server (ze01 and replace existing servers or ze13 and start appending to the list?) | 15:07 |
fungi | worth noting the replacements booted last week just reused the same names as the originals. given that, cycling back to ze01 should also work fine | 15:13 |
opendevreview | Clark Boylan proposed opendev/system-config master: Test zuul executor deployment on noble https://review.opendev.org/c/opendev/system-config/+/953127 | 15:19 |
clarkb | fungi: re gitea 1.24 did you get a chance to check the held node? Looks like you +2'd the change. Just wondering if we should proceed now or if there is more testing to be done. I haven't checked the 1.24.2 held node yet but will do so momentarily | 15:20 |
clarkb | quickly poking around the web ui looks like 1.24.1 did. I don't see anything immediately concerning. I also ran a clone that worked too and the resulting git repo locally has the content I expect so ya this looks fine to me if we think we're ready to proceed | 15:23 |
clarkb | corvus: alternatively, maybe you'd prefer we replace schedulers first? | 15:25 |
fungi | clarkb: yeah, held node looked fine to me | 15:31 |
Clark[m] | I'm grabbing something for breakfast really quickly but afterwards I guess we approve the gitea upgrade if there is no additional feedback? | 15:32 |
fungi | yeah, i'm on hand whenever, just in conference calls for another half an hour | 15:32 |
mnasiadka | clarkb: I’ll update the patch tomorrow and rebase it on tonyb’s chain | 15:52 |
clarkb | dan_with: two non urgent observations from our side on the raxflex cloud. First is that our nodepool service is leaking floating ips. This is a long standing openstack issue and not something specific to your deployment. It is expected and we even have cleanup routines to clear them out. Unfortunately, those are currently disabled because we have two competing tools running at the | 16:00 |
clarkb | moment that don't coordinate and they would delete each other's floating ips if we enabled the functionality. But it has me wondering if there is any interest in trying to find the reason for why the openstack apis say the fip is deleted when we attempt to delete them the first time and yet they stick around. The other thing is I'm manually deleting leaked floating ips that I've | 16:00 |
clarkb | checked won't break anything any the api response times from sjc3 are pretty slow. dfw3 was really quick in comparison. Not sure if that is a known issue on your end but that may be worth checking too? | 16:00 |
clarkb | jamesdenton__: ^ fyi | 16:00 |
jamesdenton__ | reading | 16:00 |
jamesdenton__ | thanks for the feedback, will check that out | 16:02 |
clarkb | jamesdenton__: our node deletion process is to delete the fip and wait for oepnstack to say it is deleted (I can't remember if this is a nova or neutron api call but we could look it up). Then afterwards we delete the node. In the case of these leaks we believe that fip delete is "succeeding" such that we proceed with deleting the node but the fip hasn't actually gone away. This is | 16:04 |
clarkb | behavior we have observed in openstack for a long long time so nothing new on your end. More just wonderingi f there is interest in trying to debug further since yall are using fips in your cloud setup | 16:04 |
jamesdenton__ | yes, most definitely | 16:04 |
jamesdenton__ | also curious what your delete strategy is so we might try and replicate | 16:05 |
corvus | clarkb: i pushed https://review.opendev.org/952697 and https://review.opendev.org/952698 a bit ago... i'm happy to continue the nobel upgrade for zuul | 16:06 |
clarkb | corvus: oh perfect. I can abandon my change. And thanks! | 16:07 |
clarkb | jamesdenton__: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L166-L196 this is the state machine code that we use | 16:07 |
corvus | np! i think i might do schedulers this week and then executors next weekend; try to do them all at once so the process doesn't drag out | 16:08 |
clarkb | looking at the code there I suppose it is possible that adapter._hasFloatingIps() is returning false and we're relying on nova to automatically clean up the fips rather than trying to do it ourselves? | 16:09 |
clarkb | corvus: ^ you probably have a better sense of how likely that is. We could add some simple logging maybe to dobule check | 16:10 |
clarkb | hrm looks like that is asking openstacksdk for any answer | 16:10 |
clarkb | ok I think I've got fip cleanup well underway. I'm going to approve the change to upgrade gitea. The testing for that should take about an hour if there are any objections or concerns we can unapprove/wip the change | 16:15 |
clarkb | infra-root (and others) I've been distracted hte last couple of weeks, but do plan to put a meeting agenda together today and chair the meeting tomorrow. Let me know if there are updates that should be made to the agenda. It is very likely I have missed things | 16:18 |
fungi | clarkb: i think there are probably a few older agenda items that could come off | 16:22 |
clarkb | ack I've made a ntoe to clean thinsg up too | 16:24 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 16:49 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 16:49 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 16:50 |
opendevreview | Michal Nasiadka proposed openstack/diskimage-builder master: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 16:51 |
corvus | clarkb: with rax flex, we call delete_floating_ip | 16:56 |
corvus | 2025-06-23 16:55:17,914 DEBUG nodepool.OpenStackAdapter.raxflex-dfw3: API call delete_floating_ip in 0.3182524172589183 | 16:56 |
corvus | so that means the full sequence you highlighted in https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L166-L196 should be run | 16:57 |
clarkb | corvus: cool thank you for confirming | 17:02 |
clarkb | so its almost like we delete the fip, it goes away so we proceed then sometime later the cloud remembers that the fip is still there? Zombie floating ips | 17:02 |
corvus | yeah; i don't think we can tell from the logs whether the fip was reported "DOWN" or whether it disappeared from the list | 17:02 |
corvus | we consider either of those sufficient to proceed | 17:03 |
corvus | i don't know whether we have seen both behaviors, or whether we only ever see "DOWN" and the other is just a safety valve | 17:03 |
corvus | but maybe we see both? and maybe that could be an important detail? i dunno. | 17:04 |
corvus | oh, sorry, to add context in case anyone isn't aware: the way that zuul/nodepool update states on things like servers or floating ips, is that they list the objects. so we do one api call to update everything at once. | 17:05 |
corvus | that's what i mean when i say "disappeared from the list" i mean the results of "list_floating_ips" | 17:06 |
corvus | so the condition for us to proceed is either 1) the status of the fip in question is reported as "DOWN" in list_floating_ips; or 2) the fip in question no longer appears in list_floating_ips | 17:07 |
clarkb | and ya it seems likely that only one of those two conditions lead to leaking the fip later (possibly the down state rather than full removal?) | 17:08 |
clarkb | since it would be weird for somethign to be completely delisted (deleted) then show up again later | 17:08 |
frickler | do we detach the fip from the instance before we delete either? | 17:09 |
clarkb | no I don't think so | 17:10 |
clarkb | the first step in that state machine loop is to send the delete request | 17:10 |
clarkb | https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/openstack/adapter.py#L173 there | 17:10 |
frickler | ok, that might be an important detail in trying to reproduce this. I also don't exactly understand why we consider fip_status=down as deleted | 17:15 |
fungi | good point, downed fips probably still count against the fip count/quota | 17:16 |
clarkb | I suspect because when you delete an fip one of the steps before final removal is marking it down? So this allows us to proceed earlier (but maybe that leads to the leak?) | 17:16 |
clarkb | I've started booting mirror-update03 while waiting on the gitea upgrade to gate | 17:18 |
clarkb | once dns is updated and the system-config change is ready for that we can start to quiesce mirror-update02 to avoid conflicts and land the 03 system-config change | 17:19 |
clarkb | and then late this week I'd like to do the gerrit restart stuff I've been punting on due to other distractions | 17:20 |
clarkb | (I'm saying that out loud here now as that should help me actually do those tasks) | 17:21 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add mirror-update03 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953139 | 17:28 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add mirror-update03 to the inventory https://review.opendev.org/c/opendev/system-config/+/951163 | 17:35 |
clarkb | I realized that we currently have separate host vars files for mirror-update01 and mirror-update02. I don't think there is any reason we need to do that so ^ updates docs to use a group vars file instead and I'll update private vars before 951163 lands | 17:36 |
fungi | yeah, i think whem 02 was created the opportunity to switch to group vars was missed | 17:38 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Point mirror-update to mirror-update03 https://review.opendev.org/c/opendev/zone-opendev.org/+/953140 | 17:38 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Remove mirror-update02 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953141 | 17:38 |
clarkb | ok I think all of that prep is done/ready. The only thing remaining is grabbing locks and/or shutting down mirror-update02 and code reviews. I'll wait for gitea before I start looking at that though as gitea should be soon | 17:41 |
opendevreview | Merged opendev/zone-opendev.org master: Add mirror-update03 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953139 | 17:42 |
fungi | zuul thinks the gitea upgrade is 4 minutes from merging, so should put the deploy before the hourly jobs | 17:45 |
fungi | it's finally wrapping up, should still land in time (if only barely) | 17:56 |
opendevreview | Merged opendev/system-config master: Update to Gitea 1.24.2 https://review.opendev.org/c/opendev/system-config/+/948560 | 17:57 |
clarkb | here we go | 17:58 |
clarkb | https://gitea09.opendev.org:3081/opendev/system-config/ is running 1.24.2 now | 18:02 |
clarkb | the restart seems a bit slow I suspect there are db migrations (and by that I mean like 40 seconds instead of the usual 5 ish) | 18:02 |
clarkb | not a big deal | 18:02 |
clarkb | git clone from that url works for me too | 18:03 |
fungi | mmm, yeah seems fine to me after | 18:05 |
fungi | from a ui perspective, seems it added a left sidebar for the file tree when looking at file contents | 18:08 |
fungi | which wasn't there in 1.23.8 | 18:09 |
clarkb | yup and there are fancier icons for files and directories it recognizes too | 18:09 |
clarkb | 09-12 are done now. | 18:09 |
fungi | sadly zuul's custom sphinx directives still cause challenges for the rst rendering | 18:10 |
fungi | https://gitea09.opendev.org:3081/zuul/zuul-jobs/src/branch/master/roles/ensure-nox/README.rst | 18:10 |
fungi | though i think it's just that it doesn't know that zuul:rolevar content bodies are printable | 18:12 |
clarkb | ya I think the renderer hits an error and just stops proceeding with the render? | 18:12 |
fungi | not exactly, because in that example it skipped the first zuul:rolevar content and printed the heading that came after it too | 18:14 |
fungi | it's done deploying on gitea14 now | 18:14 |
clarkb | 13 and 14 are done now. The job should complete soon | 18:14 |
clarkb | ah | 18:14 |
clarkb | https://zuul.opendev.org/t/openstack/buildset/b286c574dd2d4aba8f254830f30a3ffc success | 18:15 |
* fungi cheers | 18:15 | |
clarkb | the last major thing to check is probably gerrit replication | 18:15 |
fungi | we shoulda saved the mirror-update03 dns addition for this | 18:16 |
clarkb | I think https://gitea09.opendev.org:3081/starlingx/nfv/commit/43479af7b5e851f8e18d26cbaf865e7e1d0ff758 was pushed to gerrit after gitea09 updated | 18:17 |
clarkb | https://review.opendev.org/c/starlingx/nfv/+/953128 its this change | 18:17 |
clarkb | ps6 was uploaded at 18:09 according to gerrit timestamps and 18:02 ish is when I noted that gitea09 was upgraded so I think we're probably good with replication too | 18:17 |
fungi | yeah, that checks out | 18:18 |
clarkb | any concerns with me starting tograb mirror-update02 locks in a screen on that server for eventual landing of the mirror-update03 change later today? | 18:19 |
clarkb | I'm thinking it may take a bit to grab all the locks and it is almost lunch time here and I'm not sure if I have reviews yet so not sure if now is a good time to start | 18:19 |
clarkb | I don't think there are any important mirror updates that we are waiting on either though whcih is probably the biggest thing | 18:19 |
clarkb | looking at crontab -l there are a couple of cron jobs that don't flock as well taht we need to turn off (this will require putting 02 in the emergency file) | 18:21 |
clarkb | but I think I start with all the flocks. Then when I've got them all I can worry about that last piece of collision avoidance | 18:22 |
clarkb | https://gitea09.opendev.org:3081/starlingx/nfv/commit/ffa556abd7475cc2eab58ae3e15de94a296f4274 thsi one was after all 6 giteas updated (same change new patchset) | 18:22 |
clarkb | https://opendev.org/starlingx/nfv/commit/ffa556abd7475cc2eab58ae3e15de94a296f4274 a better link | 18:23 |
fungi | i wonder if it would make more sense to shut down the server? | 18:26 |
clarkb | fungi: yes I think so. But to do that safely I think we have to lock everything first | 18:27 |
clarkb | but then maybe I don't have to edit the crontab | 18:27 |
clarkb | I have 13 / 16 flocks held in a root screen on 02 | 18:27 |
clarkb | the other three are debian, ubuntu, and ubuntu-ports and they are running | 18:27 |
fungi | or just shut down the cron service and wait for everything to finish | 18:27 |
clarkb | oh ya that should work too | 18:28 |
fungi | holding locks makes sense when it's one or a handful of things we're trying to block, but when it's everything there are probably less labor-intensive solutions | 18:28 |
clarkb | ya though I wonder if cronjobs that fire and have no crond to report back to end up zombifying? I can grab the flocks to prevent new jobs then when that is done we can stop whatever the cron service is on ubuntu | 18:29 |
fungi | should probably also double-check first that everything is actually healthy, rather than trying to debug a problem after the server replacement that turns out to have been present since before we started | 18:29 |
clarkb | fungi: in afs you mean? That si a good idea | 18:29 |
fungi | yeah, i'm looking now | 18:29 |
fungi | according to grafana most of the mirrors finished around 15-16 hours ago | 18:31 |
fungi | i kinda thought we were updating them more frequently than that | 18:31 |
fungi | oh, never mind | 18:32 |
fungi | i was mixing up the wheel updates with the distro package mirrors | 18:32 |
clarkb | ah | 18:32 |
clarkb | debian just finished so I grabbed its lock | 18:32 |
fungi | everything that's configured to update has done so within the past 6 hours, yeah | 18:32 |
fungi | so looks like we should be fine | 18:33 |
fungi | a few of the mirrors are skating close to quota, but nothing looks completely full | 18:34 |
clarkb | the three non flock'd jobs are afsmon (reporting the stats grafana renders), docs/tarballs/etc release, and the publication of mirror-update logs to afs | 18:40 |
clarkb | for that last item we're rsyncing using -rltvz from the server /var/log/$location to related afs dirs | 18:41 |
clarkb | I don't think we need to spend a bunch of energy trying to coalesce old and new logs there. I think its fine to just sstop 02 from writing and let 03 start writing? | 18:42 |
fungi | yeah, i'm okay with that | 18:42 |
fungi | we'll lose something like a few days of log history if we do nothing | 18:42 |
fungi | worst case we can keep the old server for a few days in case anyone needs "old" logs | 18:43 |
clarkb | ya | 18:43 |
clarkb | I'm going to find lunch while I'm waiting for ubuntu and ubuntu-ports to finish | 18:51 |
clarkb | ok I have all 16 flocks held in the root screen | 19:23 |
clarkb | still finishing up lunch though. Not sure if you want to proceed with system shutdwon / crond stoppage | 19:23 |
clarkb | ok I'm ready now I think | 20:22 |
clarkb | mirror-update02 is in the emergency file now | 20:24 |
clarkb | fungi: do you think `systemctl stop cron.service && systemctl disable cron.service` is sufficient on mirror-update02 for us ot proceed with adding 03 to the inventory? | 20:25 |
fungi | yes | 20:26 |
fungi | we can also check for the crond process in ps | 20:26 |
fungi | but that should suffice to keep it from restarting if the server gets accidentally rebooted | 20:27 |
clarkb | done | 20:29 |
clarkb | thats in window 0 of my root screen on 02. Do you want to approve the 03 inventory addition change when you're happy with 02's state? | 20:30 |
fungi | attached | 20:31 |
fungi | looks clear to me | 20:31 |
clarkb | cool do you want to +A or should I? | 20:32 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/951163 this change | 20:32 |
fungi | i can | 20:32 |
clarkb | thanks! | 20:36 |
clarkb | I've done a quick first pass on the meeting agenda. Cleaned up some old content and updated content I have enough context for (dib centos 10 and replacing old servers in particular) | 20:47 |
opendevreview | Merged opendev/system-config master: Add mirror-update03 to the inventory https://review.opendev.org/c/opendev/system-config/+/951163 | 21:08 |
clarkb | the hourly jobs are just finishing up then ^ should deploy | 21:08 |
clarkb | the mirror-update job is running now. I expect it to be slower than usual as it dkms builds openafs | 21:18 |
clarkb | infra-prod-service-afs failed and I think it is because the setup of the vos release ssh key between mirror-udpate and afs servers means mirror-update needs to run before the afs playbook | 21:21 |
clarkb | I'm going to dig into that a bit more now and if that is the fix we should be able to push a change that sets up that relationship that redeploys things and makes them happy | 21:22 |
fungi | yeah, there's no rush on this, if anyone notices docs pages not updating or whatever we can explain | 21:27 |
clarkb | ok the issue is that mirror-update02 is in the emergency file so its ssh keypair wasn't registered and we had an assertion fail | 21:28 |
clarkb | I guess we can remove it from the emergency file and let things run, remove it from the inventory and let things run, or temporarily remove that assertion? | 21:29 |
clarkb | I think that it should be safe to remove it from the emergency file unless we ensure that cron is an enabled running service somewhere | 21:29 |
clarkb | alternatively maybe we just remove it from the inventory but don't delete the server yet | 21:30 |
clarkb | actually that may be the best option. We can then add it back to the inventory if we have to rollback. But removing it from the inventory will cause the jobs to rerun too and apply that ssh key which is nice | 21:30 |
fungi | seems safe to remove it completely, yes | 21:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove mirror-update02 from our ansible inventory https://review.opendev.org/c/opendev/system-config/+/953153 | 21:33 |
clarkb | note mirror-update03 is going to start running cronjobs nowish but they will all fail due to ssh not working | 21:34 |
clarkb | that should be ok | 21:34 |
clarkb | they will just do work then vos release will have an ssh error until we get 953153, something like 953153 or manually set up authorized keys on the afs server(s) | 21:34 |
fungi | this might also be the explanation for why they weren't using group vars originally | 21:34 |
clarkb | fungi: maybe, though the ansible configuring the keys is using the group which is why we're hitting this | 21:35 |
clarkb | its basically doing a loop over every mirror-update group member and attempting to ensure the ssh key on those servers is in authorized keys on afs servers | 21:35 |
clarkb | but since mirror-update02 is in the emergency file we never loaded the key material to pass to the other servers | 21:35 |
clarkb | in any case I think 953153 will address the issue since that removes mirror-update02 from the group by not having it in our inventory then we won't try to loop over it anymore | 21:36 |
clarkb | ya if you look at /var/log/afs-release/afs-release.log on mirror-update03 it reports failures to release due to ssh | 21:42 |
clarkb | this is expected given my understanding of the situation. I'm just calling it out that things are failing as expected and it doesn't seem to be a problem. We should be able to confirm via that log file that things are working once the fix change deploys | 21:43 |
clarkb | it runs every 5 minutes so is a good canary | 21:43 |
clarkb | do we want to enqueue 953153 directly to the gate to speed it up? Or just let it roll as is | 21:43 |
fungi | i'm about to start cooking dinner, but good with either | 21:44 |
clarkb | I'll enqueue it since I'm impatient | 21:45 |
fungi | sgtm, thanks! | 21:45 |
opendevreview | Merged opendev/system-config master: Remove mirror-update02 from our ansible inventory https://review.opendev.org/c/opendev/system-config/+/953153 | 21:54 |
clarkb | the afs deployment job is running for ^ now which should fix things for us | 22:05 |
clarkb | the job finished successfully. At 22:10 the cronjob to release docs/tarballs/etc will run and should confirm things are happy now | 22:07 |
clarkb | hrm it still failed. I suspect the problem now is known_hosts | 22:10 |
clarkb | yes I think that was it based on a manual run of ssh. I went ahead and manually added the known_host entry and we'll see if the 22:15 run is happy | 22:12 |
clarkb | it is running vos release docs right now | 22:15 |
clarkb | 2025-06-23 22:16:19,823 release DEBUG Released volume docs successfully | 22:17 |
clarkb | I think we are good now | 22:17 |
clarkb | ubuntu and ubuntu-ports are running updates too which should give us good indication from the larger volumes as well | 22:17 |
clarkb | and cloud archive just started | 22:18 |
clarkb | Released volume mirror.ubuntu-cloud successfully | 22:20 |
clarkb | once others are happy with the new server (I haven't checked any rsync mirroring results but reprepro and vos release look good on initial inspection) the next step is landing https://review.opendev.org/c/opendev/zone-opendev.org/+/953140 then deciding if/how/when to delete mirror-update02. I think with cron.service disabled we're ok to leave it running for a bit | 22:21 |
clarkb | https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc I think the data updates on here show that afsmon is working too? Also you can see after I disabled and stopped cron.service on 02 that we stopped getting vos release timer info until the new server booted | 22:23 |
fungi | awesome. thanks for working it through! | 22:24 |
clarkb | all of the rsync mirroring scripts are on */6 hourly schedules | 22:27 |
clarkb | which means they will start during the midnight UTC hour (they have varied minute rules so are spread out through that hour) | 22:27 |
clarkb | ok last call on the meeting agenda. I think I've got all the edits in that I am aware of. Let me know if anything is missing and I can add it. Otherwise I'll get that sent out in 15-20 minute | 22:45 |
clarkb | and sent | 23:06 |
clarkb | centos stream mirror should be first starting at 0006 | 23:34 |
clarkb | I'll check that is happy before I sign off for the day | 23:34 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!