ajay | hi | 03:03 |
---|---|---|
*** jroll02 is now known as jroll0 | 06:52 | |
opendevreview | Stephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays https://review.opendev.org/c/openstack/diskimage-builder/+/943500 | 12:33 |
opendevreview | Stephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays https://review.opendev.org/c/openstack/diskimage-builder/+/943500 | 12:44 |
frickler | infra-root: we seem to have some broken gitea replication. this gives me a 404, other instances seem fine https://gitea10.opendev.org:3081/openstack/python-openstackclient/src/tag/7.2.0/ see also https://zuul.opendev.org/t/openstack/build/4b097f5985b44527a4635ecc33ba70c9 | 14:22 |
fungi | interesting, python-openstackclient 7.2.0 was tagged 2024-10-18 | 14:25 |
fungi | so it's been missing from gitea10 for 5 months? | 14:25 |
fungi | and yeah, i confirmed by cloning from gitea10, the tag is missing in the git repo there, it's not just a database problem | 14:27 |
fungi | i can spot fix that by telling gerrit to re-replicate openstack/python-openstackclient to gitea10, but holding off in case anyone else wants to check on it first | 14:33 |
fungi | https://wiki.openstack.org/wiki/Infrastructure_Status doesn't mention any problems on the day that tag was pushed | 14:34 |
fungi | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-18.log.html indicates there as a gitea upgrade that merged at 17:27 utc that day | 14:35 |
fungi | though the tag was created at 10:37:01 utc so nearly a 7 hour gap before the gita upgrade would have occurred | 14:36 |
fungi | we also did a gerrit restart later that day around 18:40 utc | 14:38 |
fungi | for an upgrade | 14:39 |
clarkb | its possible the gitea instance had OOM'd | 14:48 |
clarkb | we were fighting that issue until very recently | 14:48 |
fungi | ooh, good point | 14:48 |
fungi | unfortunately dmesg only goes back as far as 2024-11-25 | 14:49 |
fungi | and we only have a month's retention for syslog | 14:49 |
clarkb | if the other giteas have the tag then I'd say that is the most likely cause | 14:49 |
clarkb | if the problem affects all giteas then something else is likely at play | 14:50 |
fungi | yeah, i concur | 14:50 |
fungi | should i go ahead and force replication for repo/backend? | 14:50 |
clarkb | I think so | 14:51 |
clarkb | I would do the whole backend as other events may have been lost too | 14:51 |
fungi | all of gitea10 you mean? | 14:52 |
clarkb | yes | 14:52 |
frickler | I'd say yes, but this also makes me wonder how many similar issues we might be having that are still unnoticed | 14:52 |
clarkb | `ssh -p 29418 admin@review02 replication start --url gitea10` iirc | 14:52 |
fungi | frickler: you mean on the other backends? | 14:52 |
clarkb | we can replicate to all of the backends. I just did 09 when I fixed the port 3000 stuff | 14:53 |
frickler | maybe, yes. but I also wonder how this wasn't fixed when later tags were replicated for the same repo | 14:53 |
fungi | i think only new refs get replicated | 14:53 |
fungi | gerrit doesn't normally do a full mirror operation with every event | 14:54 |
clarkb | fungi: correct | 14:54 |
fungi | tag objects themselves aren't parents of future commits | 14:54 |
clarkb | the pushes are for specific refs. Only events updating those refs cause them to be pushed. Unless you do a full replication request via the ssh api or similar | 14:54 |
frickler | so we might want to do a full mirror periodically just to be sure? like weekly maybe? | 14:54 |
clarkb | I don't know that that is necessary now that gitea is fairly stable. But if we want to be extra sure that would probably be fine too | 14:55 |
frickler | ok, so just doing it once for all hosts then might be good enough | 14:56 |
fungi | for a spot check that replication will fix this, i did a `replication start openstack/python-openstackclient --url=gitea10` and https://gitea10.opendev.org:3081/openstack/python-openstackclient/src/tag/7.2.0/ is returning content now | 14:56 |
fungi | i'll rerun without the repo and url patterns now to replicate everything to all backends | 14:56 |
clarkb | ack | 14:56 |
fungi | just a `replication start` in other words | 14:56 |
frickler | the other thing would be to inform the release team to hold releases not only for gerrit restarts but also for gitea actions? | 14:57 |
fungi | tailing the end of `gerrit show-queue` will give you a count of remaining tasks, most of which will be the replication | 14:57 |
clarkb | normal gitea actions should be safe with replication. We have a specific shutdown and startup process for gitea that is meant to allow gerrit to retry replication until successful. But if there is an OOM and things are killed in the wrong order etc that won't work | 14:58 |
fungi | baseline is typically around 20 tasks | 14:58 |
fungi | task count has dropped below 13k now | 14:58 |
clarkb | the underlying issue with replication and gitea shutdowns is that we need the web and db services to be up and running when pushes are made. That means the ssh service must start after those two are ready | 15:06 |
clarkb | 2529 package updates waiting for me. Looks like python313 stuff got updated | 15:18 |
clarkb | oh and xfce | 15:18 |
fungi | that's a lot of packages | 15:19 |
clarkb | I wonder if the compiler abi got bumped | 15:20 |
clarkb | replication seems to have completed | 15:37 |
fungi | confirmed | 15:40 |
clarkb | I think nodepool builders (for x86) will be the next conversion to noble. The way we do containers in containers to build rocky seems like another good exercise of the podman transition | 15:49 |
clarkb | I believe it should be completely safe to spin up two new builders (nb05 and nb06) with the appropraite workspace volume attached, Add them alongside nb01 and nb02 then shutdown services on nb01 and nb02 with the hosts in the emergency file to force the new servers to build all the images. Maybe even make some explicit requests rather than waiting on the regular schedule | 15:50 |
fungi | yeah, either they'll successfully build images, or they won't | 15:51 |
fungi | worst case we end up with some stale images when they call dibs (pun intended) on a rebuild | 15:51 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783 | 15:57 |
clarkb | thats the first step in just making sure this works in CI. If that looks good I'll work on booting the two new servers next | 15:57 |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 16:18 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783 | 16:29 |
clarkb | that is just a small test thing. Once that comes back happy next step is booting new nodes | 16:29 |
clarkb | actually just found something else that needs correcting. | 16:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783 | 16:37 |
clarkb | ok I think that may be it | 16:37 |
clarkb | actually I should probably put that into a parent change so that we check focal and nobel both | 16:37 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783 | 16:41 |
opendevreview | Clark Boylan proposed opendev/system-config master: Make nodepool image export forward/backward compatible https://review.opendev.org/c/opendev/system-config/+/944788 | 16:41 |
opendevreview | Tristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url https://review.opendev.org/c/zuul/zuul-jobs/+/927582 | 16:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Rebuild our base python container images https://review.opendev.org/c/opendev/system-config/+/944789 | 16:55 |
clarkb | note to myself (and anyone else who is curious I guess): the current x86 nodepool builders have 2x1TB volumes merged together via lvm into a 2TB ext4 fs mounted on /opt for dib workspace area | 17:03 |
clarkb | I think I can use the launch node flags that end up running https://opendev.org/opendev/system-config/src/branch/master/launch/src/opendev_launch/mount_volume.sh against the first 1TB volume. The I can manually add the second volume, format it, pvcreate on it, then vgextend and lvextend and finally resize the ext4 fs? | 17:06 |
clarkb | or should I not let launch node do any of it so that the ext4 creation is aware of the total final size when allocating inodes etc? | 17:06 |
corvus | is 1x 2tb volume an option? | 17:06 |
clarkb | fungi: ^ do you know if extending the ext4 fs will allocate new resources for inodes etc? | 17:06 |
clarkb | corvus: no rax classic limits volumes to 1TB | 17:07 |
fungi | clarkb: ext4 scales inode limits by block count, i want to say | 17:07 |
fungi | are we anywhere close to running out? | 17:08 |
clarkb | not on nb01 | 17:08 |
clarkb | 5% inode usage with 37% disk usage | 17:08 |
clarkb | so maybe its fine. I'm happy to do it all from scratch if we think that is a better end result though | 17:09 |
fungi | yeah, i wouldn't worry much about it for now | 17:09 |
clarkb | ok those system-config changes to smoke test noble nodepool builder deployments are happy. Time to boot a new server | 17:22 |
corvus | +2s from me | 17:25 |
clarkb | the new server is booting. Once I've got nb05 sorted out I'll boot nb06 and then do a single change to add both to the inventory (updating 944783 to do that) | 17:31 |
clarkb | the /opt mount point for the ephemeral drive and the nodepool dib workspace volume collided but I think I can just remove the ephemeral drive mount point from fstab and call it good (seems this was done with nb01) | 17:44 |
clarkb | infra-root 104.130.253.28 is the new nb05 server. I think it lgtm so I'm going to repeat what I did to build an nb06 now | 17:59 |
clarkb | resize2fs made me e2fsck first but otherwise I didn't hit anything unexpected | 18:04 |
fungi | huh, normally that doesn't happen, but maybe the volume had gone too long without a fsck | 18:11 |
clarkb | it was all brand new. I was wondering if it needed fsck data to resize and we didn't have any yet | 18:12 |
fungi | oh, huh, why did you need to resize2fs then if it was new? | 18:13 |
fungi | you started with a single volume and already had data in it, then wanted to add a second volume? | 18:14 |
clarkb | fungi: because launch node creates the fs on the first volume but doesn't know how to handle two volumes in one vg/lv. So I let launch node do everything with one then add the second volume and extend things | 18:14 |
clarkb | both volumes are newly created | 18:14 |
clarkb | this is what I was aksing about above if I should do it all by hand. But it seems like this is working so I think its ok? | 18:15 |
fungi | aha, another reason that volume-adding functionality in launch-node is probably not all that helpful (like where we carve them up into multiple logical volumes on mirror servers) | 18:15 |
clarkb | fwiw we ended up with more inodes on the nb05 filesystem when I was done than we had on nb01 so I think we're ok | 18:16 |
clarkb | not a ton more but almost 10% if my napkin math is correct | 18:16 |
fungi | yeah, i expect nb01 has seen multuple increases to its volume group. might have started with a smaller single volume | 18:16 |
fungi | then later migrated the extents to a larger volume, then later still added another volume | 18:17 |
fungi | anyway, if we don't see other builders using anywhere near that many inodes, then no i still wouldn't worry about "fixing" them for the sake of consistency or anything | 18:19 |
fungi | (i misunderstood earlier thinking you were asking if we should redo the existing volumes on the old builders) | 18:20 |
clarkb | ya I was trying to figure out if it is better to mkfs once rather than mkfs and resize2fs | 18:24 |
clarkb | which still isn't clear to me since we ended up with a bette situation than the preexisting setup | 18:24 |
clarkb | nb06 is up at 104.130.127.175 with volumes sorted out. I'm working on dns and inventory changes now | 18:25 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add nb05 and nb06 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/944794 | 18:29 |
opendevreview | Clark Boylan proposed opendev/system-config master: Deploy nb05 and nb06 Noble nodepool builders https://review.opendev.org/c/opendev/system-config/+/944783 | 18:33 |
clarkb | ok I think those two changes are ready to go. Feel free to jump onto those servers and double check what I did volume and lvm wise | 18:34 |
clarkb | now is a good time to remind people to edit the meeting agnda or let me know what updates you'd like to see | 18:36 |
clarkb | It is just about lunch time but I'll probably start on the meeting agenda afterwards | 18:36 |
fungi | huh, 944789 failed system-config-build-image-uwsgi-base-3.10-bookworm on a segfault during pip install uwsgi | 18:38 |
Clark[m] | Arg we've seen that before and had hacks like single thread uwsgi builds to work around it. I wonder though if we need the 3.10 image anymore | 18:41 |
Clark[m] | I think lodgeit is the one thing we use depending on it | 18:41 |
Clark[m] | So maybe we need to simplify image builds first? | 18:41 |
fungi | do you recall why we install lodgeit on 3.10? lack of 3.11 support or just haven't gotten around to upgrading it? | 18:42 |
Clark[m] | Oh I meant lodgeit is the only thing using the uwsgi image. Looks like it moved to 3.11 already | 18:43 |
opendevreview | Merged opendev/zone-opendev.org master: Add nb05 and nb06 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/944794 | 18:43 |
Clark[m] | So we can likely drop that image build. I wonder if we can do the same with python-builder and python-base yet | 18:44 |
fungi | yeah, the fewer jobs we need to run, the better all around | 18:45 |
Clark[m] | I can check after lunch | 18:46 |
Clark[m] | https://codesearch.opendev.org/?q=3.10-bookworm&i=nope&literal=nope&files=&excludeFiles=&repos= this quick check seems to show cleanup is safe. | 18:49 |
clarkb | fungi: you approved https://review.opendev.org/c/opendev/system-config/+/944783 but didn't review its parent. Cna you do that too? | 19:50 |
opendevreview | Clark Boylan proposed opendev/system-config master: Rebuild our base python container images https://review.opendev.org/c/opendev/system-config/+/944789 | 19:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: Drop python3.10 container image builds https://review.opendev.org/c/opendev/system-config/+/944799 | 19:53 |
clarkb | and that is the cleanup of 3.10 image builds to hopefully make things happier building for 3.11 and 2.12 | 19:53 |
clarkb | *3.12 | 19:53 |
opendevreview | Merged zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook https://review.opendev.org/c/zuul/zuul-jobs/+/927600 | 19:59 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Pull minio from quay instead of docker hub https://review.opendev.org/c/zuul/zuul-jobs/+/944801 | 20:02 |
fungi | oh, thanks, i wondered why it wasn't enqueuing, i reviewed the depends-on but missed it had a parent commit as well | 20:12 |
fungi | lgtm | 20:13 |
clarkb | I'm going to drop the parallel infra-prod meeting agenda item. I think it is working well and not sure I have anything to say other than that | 20:13 |
fungi | confirmed we install a /usr/local/bin/docker-compose as a wrapper around `docker compose` on noble | 20:14 |
clarkb | yup | 20:14 |
fungi | yeah, seems the parallel stuff is working well, only reason i can think to keep it on the agenda is we cranked its mutex up to 4x | 20:14 |
clarkb | my first pass on meeting agenda edits is in | 20:19 |
clarkb | heh now uwsgi 3.11 is afiling | 20:20 |
clarkb | I'm going to look into what we did in the past to hack around its problems | 20:20 |
clarkb | I'm quickly beginning to wonder if we should consider use something other than uwsgi | 20:25 |
clarkb | but one step at a time | 20:25 |
opendevreview | Merged opendev/system-config master: Make nodepool image export forward/backward compatible https://review.opendev.org/c/opendev/system-config/+/944788 | 20:26 |
opendevreview | Merged zuul/zuul-jobs master: Pull minio from quay instead of docker hub https://review.opendev.org/c/zuul/zuul-jobs/+/944801 | 20:28 |
opendevreview | Clark Boylan proposed opendev/system-config master: Rebuild our base python container images https://review.opendev.org/c/opendev/system-config/+/944789 | 20:31 |
fungi | wow, increasing pip verbosity *fixed* builds in the past? that's pure heisenbug | 20:32 |
clarkb | yes | 20:32 |
clarkb | we discovered this because we added ti for more debugging then it stopped failing. This was on bullseye not bookworm though so may be a noop here | 20:33 |
clarkb | but I figured starting with teh old uwsgi compiles are flaky solution that adds more debuggign info was a good start | 20:33 |
clarkb | but also it almost looks like gcc is segfaulting? | 20:33 |
clarkb | which is very interesting | 20:33 |
fungi | i'm taking that as a hint that i need a break before my 7pm conference call | 20:33 |
clarkb | I'm going to propose a change to lodgeit that attempts to use https://pypi.org/project/granian/ | 20:35 |
clarkb | I think if we switch to granian we don't need a dedicated image bceause they are good and publish wheels for x86 and aarch64 | 20:35 |
fungi | looks promising | 20:37 |
opendevreview | Merged opendev/system-config master: Deploy nb05 and nb06 Noble nodepool builders https://review.opendev.org/c/opendev/system-config/+/944783 | 20:45 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805 | 20:47 |
clarkb | I don't know if ^ will work but its a start | 20:47 |
clarkb | 944783 is deploying all the things as expected. Probably half an hour or so away from deploying the new nodepool servers? | 20:48 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run lodgeit with granian https://review.opendev.org/c/opendev/system-config/+/944806 | 21:00 |
clarkb | I realized we don't have functional testing of lodgeit in lodgeit. ^ attempts to cover that gap | 21:01 |
frickler | https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ :( | 21:09 |
opendevreview | Stephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays https://review.opendev.org/c/openstack/diskimage-builder/+/943500 | 21:12 |
clarkb | nepenthes is really neat. But I'm thankful we havne't had to go that route yet | 21:14 |
clarkb | fungi: theory: what if uwsgi builds are failing because the python-builder image it is using is stale | 21:18 |
clarkb | hrm no we have dependencies and requires on the other images so we should have latest and greatest across the board | 21:20 |
clarkb | the nb05 and nb06 deployments completed and look ok at first glance. Once the hourlies finish for 21:00 I'll put nb01 and nb02 in the emergency file and shutdown their builders | 21:22 |
clarkb | then we can request builds for things and see what happens | 21:23 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add upload-image-swift role https://review.opendev.org/c/zuul/zuul-jobs/+/944812 | 21:40 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813 | 21:40 |
clarkb | I've rechecked 944806 with an autohold in place. Running the image locally I wasn't able to find anything that looked like the problem to me | 21:50 |
clarkb | but I didn't have a db running to connect to and that is where it appears to be failing | 21:50 |
clarkb | now to shutdown the old builders | 21:51 |
clarkb | this is done | 21:52 |
clarkb | I'll request an ubuntu noble image build and a rocky image build I guess | 21:52 |
clarkb | corvus: question: does nodepool still get weird when the builder responsible for an image goes away? | 21:55 |
clarkb | corvus: I'm wondering because I just stopped the builders and they aren't running yet. I want to say we've corrected most of the issues with that and what should happe nis nodepool will keep those images around as long as we need them but will let other builders delete those images if those builders are gone and we don't need the image anymore? | 21:56 |
clarkb | I guess we'll fine out :/ | 21:57 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813 | 21:57 |
clarkb | I guess the worst case may be that we have to turn nb01 and nb02 back on temporarily to let them cleanup after themselves once we're happy with nb05 and nb06 | 21:57 |
corvus | clarkb: tbh not sure off the top of my head | 21:58 |
corvus | clarkb: i think the uploads should get deleted; i'm not sure about the zk record for the on-disk build | 22:00 |
clarkb | corvus: but the uploads should only delete after we've uploaded new replacement images from the new servers right? this won't be a case of accidetnal early cleanup | 22:01 |
clarkb | (fwiw I haven't seen images get deleted yet so I think that is the case) | 22:01 |
clarkb | and I can followup with nodepool later once we have an image or two build by the new severs that cycle things | 22:01 |
corvus | oh yeah, i don't think there should be any accidental early cleanup | 22:01 |
corvus | i think there's a good chance the automatic cleanup should work too without needing to bring the old ones online | 22:02 |
clarkb | excellent I'll monitor and see where we end up | 22:02 |
corvus | (i think it will delete all the uploads, then delete the zk build records when all upload records are gone; and only then would the old builders delete their local files, but we don't care about the last step) | 22:02 |
clarkb | ack | 22:02 |
corvus | but again, that's only once it's decided the old uploads should be expired due to new uploads; i don't expect it to start that process before then | 22:03 |
clarkb | cool that seems to be in line with what I see so far (but we don't have any new images uploaded yet). Each new builder is building one image. They are both caching git repos now which isn't super fast the first time | 22:03 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813 | 22:12 |
clarkb | so the problem with lodgeit and granian is that granian expects to call the application(environ, start_response) wsgi interface but we instead seem to have a factory for those interfaces and I guess uwsgi is smart enough to call the factory instead? | 22:34 |
clarkb | anyway its going to require some refactoring to get this to work I think | 22:34 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805 | 22:39 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805 | 22:41 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run lodgeit with granian https://review.opendev.org/c/opendev/system-config/+/944806 | 22:41 |
clarkb | I think that may function now but I'm not sure that it is production quality yet. I think we need to memoize some stuff | 22:42 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813 | 22:49 |
clarkb | last call on meeting agenda items. I've got a meeting and then I'm sending it out around 00:00 UTC | 22:50 |
clarkb | the new ubuntu noble image built and is uploading niow | 23:00 |
clarkb | rockylinux image built too so these builders are looking happy so far | 23:41 |
clarkb | the latest patches for lodgeit under granian do seem to work. I think the next step there is making it more efficient which I'll have to look into tomorrow | 23:55 |
clarkb | and now time to get the meeting agenda out | 23:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!