Monday, 2025-03-17

ajayhi03:03
*** jroll02 is now known as jroll006:52
opendevreviewStephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays  https://review.opendev.org/c/openstack/diskimage-builder/+/94350012:33
opendevreviewStephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays  https://review.opendev.org/c/openstack/diskimage-builder/+/94350012:44
fricklerinfra-root: we seem to have some broken gitea replication. this gives me a 404, other instances seem fine https://gitea10.opendev.org:3081/openstack/python-openstackclient/src/tag/7.2.0/ see also https://zuul.opendev.org/t/openstack/build/4b097f5985b44527a4635ecc33ba70c914:22
fungiinteresting, python-openstackclient 7.2.0 was tagged 2024-10-1814:25
fungiso it's been missing from gitea10 for 5 months?14:25
fungiand yeah, i confirmed by cloning from gitea10, the tag is missing in the git repo there, it's not just a database problem14:27
fungii can spot fix that by telling gerrit to re-replicate openstack/python-openstackclient to gitea10, but holding off in case anyone else wants to check on it first14:33
fungihttps://wiki.openstack.org/wiki/Infrastructure_Status doesn't mention any problems on the day that tag was pushed14:34
fungihttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-18.log.html indicates there as a gitea upgrade that merged at 17:27 utc that day14:35
fungithough the tag was created at 10:37:01 utc so nearly a 7 hour gap before the gita upgrade would have occurred14:36
fungiwe also did a gerrit restart later that day around 18:40 utc14:38
fungifor an upgrade14:39
clarkbits possible the gitea instance had OOM'd14:48
clarkbwe were fighting that issue until very recently14:48
fungiooh, good point14:48
fungiunfortunately dmesg only goes back as far as 2024-11-2514:49
fungiand we only have a month's retention for syslog14:49
clarkbif the other giteas have the tag then I'd say that is the most likely cause14:49
clarkbif the problem affects all giteas then something else is likely at play14:50
fungiyeah, i concur14:50
fungishould i go ahead and force replication for repo/backend?14:50
clarkbI think so14:51
clarkbI would do the whole backend as other events may have been lost too14:51
fungiall of gitea10 you mean?14:52
clarkbyes14:52
fricklerI'd say yes, but this also makes me wonder how many similar issues we might be having that are still unnoticed14:52
clarkb`ssh -p 29418 admin@review02 replication start --url gitea10` iirc14:52
fungifrickler: you mean on the other backends?14:52
clarkbwe can replicate to all of the backends. I just did 09 when I fixed the port 3000 stuff14:53
fricklermaybe, yes. but I also wonder how this wasn't fixed when later tags were replicated for the same repo14:53
fungii think only new refs get replicated14:53
fungigerrit doesn't normally do a full mirror operation with every event14:54
clarkbfungi: correct14:54
fungitag objects themselves aren't parents of future commits14:54
clarkbthe pushes are for specific refs. Only events updating those refs cause them to be pushed. Unless you do a full replication request via the ssh api or similar14:54
fricklerso we might want to do a full mirror periodically just to be sure? like weekly maybe?14:54
clarkbI don't know that that is necessary now that gitea is fairly stable. But if we want to be extra sure that would probably be fine too14:55
fricklerok, so just doing it once for all hosts then might be good enough14:56
fungifor a spot check that replication will fix this, i did a `replication start openstack/python-openstackclient --url=gitea10` and https://gitea10.opendev.org:3081/openstack/python-openstackclient/src/tag/7.2.0/ is returning content now14:56
fungii'll rerun without the repo and url patterns now to replicate everything to all backends14:56
clarkback14:56
fungijust a `replication start` in other words14:56
fricklerthe other thing would be to inform the release team to hold releases not only for gerrit restarts but also for gitea actions?14:57
fungitailing the end of `gerrit show-queue` will give you a count of remaining tasks, most of which will be the replication14:57
clarkbnormal gitea actions should be safe with replication. We have a specific shutdown and startup process for gitea that is meant to allow gerrit to retry replication until successful. But if there is an OOM and things are killed in the wrong order etc that won't work14:58
fungibaseline is typically around 20 tasks14:58
fungitask count has dropped below 13k now14:58
clarkbthe underlying issue with replication and gitea shutdowns is that we need the web and db services to be up and running when pushes are made. That means the ssh service must start after those two are ready15:06
clarkb2529 package updates waiting for me. Looks like python313 stuff got updated15:18
clarkboh and xfce15:18
fungithat's a lot of packages15:19
clarkbI wonder if the compiler abi got bumped15:20
clarkbreplication seems to have completed15:37
fungiconfirmed15:40
clarkbI think nodepool builders (for x86) will be the next conversion to noble. The way we do containers in containers to build rocky seems like another good exercise of the podman transition15:49
clarkbI believe it should be completely safe to spin up two new builders (nb05 and nb06) with the appropraite workspace volume attached, Add them alongside nb01 and nb02 then shutdown services on nb01 and nb02 with the hosts in the emergency file to force the new servers to  build all the images. Maybe even make some explicit requests rather than waiting on the regular schedule15:50
fungiyeah, either they'll successfully build images, or they won't15:51
fungiworst case we end up with some stale images when they call dibs (pun intended) on a rebuild15:51
opendevreviewClark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble  https://review.opendev.org/c/opendev/system-config/+/94478315:57
clarkbthats the first step in just making sure this works in CI. If that looks good I'll work on booting the two new servers next15:57
opendevreviewKarolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles  https://review.opendev.org/c/opendev/glean/+/94167216:18
opendevreviewClark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble  https://review.opendev.org/c/opendev/system-config/+/94478316:29
clarkbthat is just a small test thing. Once that comes back happy next step is booting new nodes16:29
clarkbactually just found something else that needs correcting.16:31
opendevreviewClark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble  https://review.opendev.org/c/opendev/system-config/+/94478316:37
clarkbok I think that may be it16:37
clarkbactually I should probably put that into a parent change so that we check focal and nobel both16:37
opendevreviewClark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble  https://review.opendev.org/c/opendev/system-config/+/94478316:41
opendevreviewClark Boylan proposed opendev/system-config master: Make nodepool image export forward/backward compatible  https://review.opendev.org/c/opendev/system-config/+/94478816:41
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758216:43
opendevreviewClark Boylan proposed opendev/system-config master: Rebuild our base python container images  https://review.opendev.org/c/opendev/system-config/+/94478916:55
clarkbnote to myself (and anyone else who is curious I guess): the current x86 nodepool builders have 2x1TB volumes merged together via lvm into a 2TB ext4 fs mounted on /opt for dib workspace area17:03
clarkbI think I can use the launch node flags that end up running https://opendev.org/opendev/system-config/src/branch/master/launch/src/opendev_launch/mount_volume.sh against the first 1TB volume. The I can manually add the second volume, format it, pvcreate on it, then vgextend and lvextend and finally resize the ext4 fs?17:06
clarkbor should I not let launch node do any of it so that the ext4 creation is aware of the total final size when allocating inodes etc?17:06
corvusis 1x 2tb volume an option?17:06
clarkbfungi: ^ do you know if extending the ext4 fs will allocate new resources for inodes etc?17:06
clarkbcorvus: no rax classic limits volumes to 1TB17:07
fungiclarkb: ext4 scales inode limits by block count, i want to say17:07
fungiare we anywhere close to running out?17:08
clarkbnot on nb0117:08
clarkb5% inode usage with 37% disk usage17:08
clarkbso maybe its fine. I'm happy to do it all from scratch if we think that is a better end result though17:09
fungiyeah, i wouldn't worry much about it for now17:09
clarkbok those system-config changes to smoke test noble nodepool builder deployments are happy. Time to boot a new server17:22
corvus+2s from me17:25
clarkbthe new server is booting. Once I've got nb05 sorted out I'll boot nb06 and then do a single change to add both to the inventory (updating 944783 to do that)17:31
clarkbthe /opt mount point for the ephemeral drive and the nodepool dib workspace volume collided but I think I can just remove the ephemeral drive mount point from fstab and call it good (seems this was done with nb01)17:44
clarkbinfra-root 104.130.253.28 is the new nb05 server. I think it lgtm so I'm going to repeat what I did to build an nb06 now17:59
clarkbresize2fs made me e2fsck first but otherwise I didn't hit anything unexpected18:04
fungihuh, normally that doesn't happen, but maybe the volume had gone too long without a fsck18:11
clarkbit was all brand new. I was wondering if it needed fsck data to resize and we didn't have any yet18:12
fungioh, huh, why did you need to resize2fs then if it was new?18:13
fungiyou started with a single volume and already had data in it, then wanted to add a second volume?18:14
clarkbfungi: because launch node creates the fs on the first volume but doesn't know how to handle two volumes in one vg/lv. So I let launch node do everything with one then add the second volume and extend things18:14
clarkbboth volumes are newly created18:14
clarkbthis is what I was aksing about above if I should do it all by hand. But it seems like this is working so I think its ok?18:15
fungiaha, another reason that volume-adding functionality in launch-node is probably not all that helpful (like where we carve them up into multiple logical volumes on mirror servers)18:15
clarkbfwiw we ended up with more inodes on the nb05 filesystem when I was done than we had on nb01 so I think we're ok18:16
clarkbnot a ton more but almost 10% if my napkin math is correct18:16
fungiyeah, i expect nb01 has seen multuple increases to its volume group. might have started with a smaller single volume18:16
fungithen later migrated the extents to a larger volume, then later still added another volume18:17
fungianyway, if we don't see other builders using anywhere near that many inodes, then no i still wouldn't worry about "fixing" them for the sake of consistency or anything18:19
fungi(i misunderstood earlier thinking you were asking if we should redo the existing volumes on the old builders)18:20
clarkbya I was trying to figure out if it is better to mkfs once rather than mkfs and resize2fs18:24
clarkbwhich still isn't clear to me since we ended up with a bette situation than the preexisting setup18:24
clarkbnb06 is up at 104.130.127.175 with volumes sorted out. I'm working on dns and inventory changes now18:25
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add nb05 and nb06 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94479418:29
opendevreviewClark Boylan proposed opendev/system-config master: Deploy nb05 and nb06 Noble nodepool builders  https://review.opendev.org/c/opendev/system-config/+/94478318:33
clarkbok I think those two changes are ready to go. Feel free to jump onto those servers and double check what I did volume and lvm wise18:34
clarkbnow is a good time to remind people to edit the meeting agnda or let me know what updates you'd like to see18:36
clarkbIt is just about lunch time but I'll probably start on the meeting agenda afterwards18:36
fungihuh, 944789 failed system-config-build-image-uwsgi-base-3.10-bookworm on a segfault during pip install uwsgi18:38
Clark[m]Arg we've seen that before and had hacks like single thread uwsgi builds to work around it. I wonder though if we need the 3.10 image anymore18:41
Clark[m]I think lodgeit is the one thing we use depending on it18:41
Clark[m]So maybe we need to simplify image builds first?18:41
fungido you recall why we install lodgeit on 3.10? lack of 3.11 support or just haven't gotten around to upgrading it?18:42
Clark[m]Oh I meant lodgeit is the only thing using the uwsgi image. Looks like it moved to 3.11 already18:43
opendevreviewMerged opendev/zone-opendev.org master: Add nb05 and nb06 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94479418:43
Clark[m]So we can likely drop that image build. I wonder if we can do the same with python-builder and python-base yet18:44
fungiyeah, the fewer jobs we need to run, the better all around18:45
Clark[m]I can check after lunch18:46
Clark[m]https://codesearch.opendev.org/?q=3.10-bookworm&i=nope&literal=nope&files=&excludeFiles=&repos= this quick check seems to show cleanup is safe.18:49
clarkbfungi: you approved https://review.opendev.org/c/opendev/system-config/+/944783 but didn't review its parent. Cna you do that too?19:50
opendevreviewClark Boylan proposed opendev/system-config master: Rebuild our base python container images  https://review.opendev.org/c/opendev/system-config/+/94478919:53
opendevreviewClark Boylan proposed opendev/system-config master: Drop python3.10 container image builds  https://review.opendev.org/c/opendev/system-config/+/94479919:53
clarkband that is the cleanup of 3.10 image builds to hopefully make things happier building for 3.11 and 2.1219:53
clarkb*3.1219:53
opendevreviewMerged zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook  https://review.opendev.org/c/zuul/zuul-jobs/+/92760019:59
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Pull minio from quay instead of docker hub  https://review.opendev.org/c/zuul/zuul-jobs/+/94480120:02
fungioh, thanks, i wondered why it wasn't enqueuing, i reviewed the depends-on but missed it had a parent commit as well20:12
fungilgtm20:13
clarkbI'm going to drop the parallel infra-prod meeting agenda item. I think it is working well and not sure I have anything to say other than that20:13
fungiconfirmed we install a /usr/local/bin/docker-compose as a wrapper around `docker compose` on noble20:14
clarkbyup20:14
fungiyeah, seems the parallel stuff is working well, only reason i can think to keep it on the agenda is we cranked its mutex up to 4x20:14
clarkbmy first pass on meeting agenda edits is in20:19
clarkbheh now uwsgi 3.11 is afiling20:20
clarkbI'm going to look into what we did in the past to hack around its problems20:20
clarkbI'm quickly beginning to wonder if we should consider use something other than uwsgi20:25
clarkbbut one step at a time20:25
opendevreviewMerged opendev/system-config master: Make nodepool image export forward/backward compatible  https://review.opendev.org/c/opendev/system-config/+/94478820:26
opendevreviewMerged zuul/zuul-jobs master: Pull minio from quay instead of docker hub  https://review.opendev.org/c/zuul/zuul-jobs/+/94480120:28
opendevreviewClark Boylan proposed opendev/system-config master: Rebuild our base python container images  https://review.opendev.org/c/opendev/system-config/+/94478920:31
fungiwow, increasing pip verbosity *fixed* builds in the past? that's pure heisenbug20:32
clarkbyes20:32
clarkbwe discovered this because we added ti for more debugging then it stopped failing. This was on bullseye not bookworm though so may be a noop here20:33
clarkbbut I figured starting with teh old uwsgi compiles are flaky solution that adds more debuggign info was a good start20:33
clarkbbut also it almost looks like gcc is segfaulting?20:33
clarkbwhich is very interesting20:33
fungii'm taking that as a hint that i need a break before my 7pm conference call20:33
clarkbI'm going to propose a change to lodgeit that attempts to use https://pypi.org/project/granian/20:35
clarkbI think if we switch to granian we don't need a dedicated image bceause they are good and publish wheels for x86 and aarch6420:35
fungilooks promising20:37
opendevreviewMerged opendev/system-config master: Deploy nb05 and nb06 Noble nodepool builders  https://review.opendev.org/c/opendev/system-config/+/94478320:45
opendevreviewClark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi  https://review.opendev.org/c/opendev/lodgeit/+/94480520:47
clarkbI don't know if ^ will work but its a start20:47
clarkb944783 is deploying all the things as expected. Probably half an hour or so away from deploying the new nodepool servers?20:48
opendevreviewClark Boylan proposed opendev/system-config master: Run lodgeit with granian  https://review.opendev.org/c/opendev/system-config/+/94480621:00
clarkbI realized we don't have functional testing of lodgeit in lodgeit. ^ attempts to cover that gap21:01
fricklerhttps://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ :(21:09
opendevreviewStephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays  https://review.opendev.org/c/openstack/diskimage-builder/+/94350021:12
clarkbnepenthes is really neat. But I'm thankful we havne't had to go that route yet21:14
clarkbfungi: theory: what if uwsgi builds are failing because the python-builder image it is using is stale21:18
clarkbhrm no we have dependencies and requires on the other images so we should have latest and greatest across the board21:20
clarkbthe nb05 and nb06 deployments completed and look ok at first glance. Once the hourlies finish for 21:00 I'll put nb01 and nb02 in the emergency file and shutdown their builders21:22
clarkbthen we can request builds for things and see what happens21:23
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add upload-image-swift role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481221:40
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481321:40
clarkbI've rechecked 944806 with an autohold in place. Running the image locally I wasn't able to find anything that looked like the problem to me21:50
clarkbbut I didn't have a db running to connect to and that is where it appears to be failing21:50
clarkbnow to shutdown the old builders21:51
clarkbthis is done21:52
clarkbI'll request an ubuntu noble image build and a rocky image build I guess21:52
clarkbcorvus: question: does nodepool still get weird when the builder responsible for an image goes away?21:55
clarkbcorvus: I'm wondering because I just stopped the builders and they aren't running yet. I want to say we've corrected most of the issues with that and what should happe nis nodepool will keep those images around as long as we need them but will let other builders delete those images if those builders are gone and we don't need the image anymore?21:56
clarkbI guess we'll fine out :/21:57
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481321:57
clarkbI guess the worst case may be that we have to turn nb01 and nb02 back on temporarily to let them cleanup after themselves once we're happy with nb05 and nb0621:57
corvusclarkb: tbh not sure off the top of my head21:58
corvusclarkb: i think the uploads should get deleted; i'm not sure about the zk record for the on-disk build22:00
clarkbcorvus: but the uploads should only delete after we've uploaded new replacement images from the new servers right? this won't be a case of accidetnal early cleanup22:01
clarkb(fwiw I haven't seen images get deleted yet so I think that is the case)22:01
clarkband I can followup with nodepool later once we have an image or two build by the new severs that cycle things22:01
corvusoh yeah, i don't think there should be any accidental early cleanup22:01
corvusi think there's a good chance the automatic cleanup should work too without needing to bring the old ones online22:02
clarkbexcellent I'll monitor and see where we end up22:02
corvus(i think it will delete all the uploads, then delete the zk build records when all upload records are gone; and only then would the old builders delete their local files, but we don't care about the last step)22:02
clarkback22:02
corvusbut again, that's only once it's decided the old uploads should be expired due to new uploads; i don't expect it to start that process before then22:03
clarkbcool that seems to be in line with what I see so far (but we don't have any new images uploaded yet). Each new builder is building one image. They are both caching git repos now which isn't super fast the first time22:03
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481322:12
clarkbso the problem with lodgeit and granian is that granian expects to call the application(environ, start_response) wsgi interface but we instead seem to have a factory for those interfaces and I guess uwsgi is smart enough to call the factory instead?22:34
clarkbanyway its going to require some refactoring to get this to work I think22:34
opendevreviewClark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi  https://review.opendev.org/c/opendev/lodgeit/+/94480522:39
opendevreviewClark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi  https://review.opendev.org/c/opendev/lodgeit/+/94480522:41
opendevreviewClark Boylan proposed opendev/system-config master: Run lodgeit with granian  https://review.opendev.org/c/opendev/system-config/+/94480622:41
clarkbI think that may function now but I'm not sure that it is production quality yet. I think we need to memoize some stuff22:42
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481322:49
clarkblast call on meeting agenda items. I've got a meeting and then I'm sending it out around 00:00 UTC22:50
clarkbthe new ubuntu noble image built and is uploading niow23:00
clarkbrockylinux image built too so these builders are looking happy so far23:41
clarkbthe latest patches for lodgeit under granian do seem to work. I think the next step there is making it more efficient which I'll have to look into tomorrow23:55
clarkband now time to get the meeting agenda out23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!