Monday, 2025-03-17

ajay	hi	03:03
*** jroll02 is now known as jroll0		06:52
opendevreview	Stephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays https://review.opendev.org/c/openstack/diskimage-builder/+/943500	12:33
opendevreview	Stephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays https://review.opendev.org/c/openstack/diskimage-builder/+/943500	12:44
frickler	infra-root: we seem to have some broken gitea replication. this gives me a 404, other instances seem fine https://gitea10.opendev.org:3081/openstack/python-openstackclient/src/tag/7.2.0/ see also https://zuul.opendev.org/t/openstack/build/4b097f5985b44527a4635ecc33ba70c9	14:22
fungi	interesting, python-openstackclient 7.2.0 was tagged 2024-10-18	14:25
fungi	so it's been missing from gitea10 for 5 months?	14:25
fungi	and yeah, i confirmed by cloning from gitea10, the tag is missing in the git repo there, it's not just a database problem	14:27
fungi	i can spot fix that by telling gerrit to re-replicate openstack/python-openstackclient to gitea10, but holding off in case anyone else wants to check on it first	14:33
fungi	https://wiki.openstack.org/wiki/Infrastructure_Status doesn't mention any problems on the day that tag was pushed	14:34
fungi	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-18.log.html indicates there as a gitea upgrade that merged at 17:27 utc that day	14:35
fungi	though the tag was created at 10:37:01 utc so nearly a 7 hour gap before the gita upgrade would have occurred	14:36
fungi	we also did a gerrit restart later that day around 18:40 utc	14:38
fungi	for an upgrade	14:39
clarkb	its possible the gitea instance had OOM'd	14:48
clarkb	we were fighting that issue until very recently	14:48
fungi	ooh, good point	14:48
fungi	unfortunately dmesg only goes back as far as 2024-11-25	14:49
fungi	and we only have a month's retention for syslog	14:49
clarkb	if the other giteas have the tag then I'd say that is the most likely cause	14:49
clarkb	if the problem affects all giteas then something else is likely at play	14:50
fungi	yeah, i concur	14:50
fungi	should i go ahead and force replication for repo/backend?	14:50
clarkb	I think so	14:51
clarkb	I would do the whole backend as other events may have been lost too	14:51
fungi	all of gitea10 you mean?	14:52
clarkb	yes	14:52
frickler	I'd say yes, but this also makes me wonder how many similar issues we might be having that are still unnoticed	14:52
clarkb	`ssh -p 29418 admin@review02 replication start --url gitea10` iirc	14:52
fungi	frickler: you mean on the other backends?	14:52
clarkb	we can replicate to all of the backends. I just did 09 when I fixed the port 3000 stuff	14:53
frickler	maybe, yes. but I also wonder how this wasn't fixed when later tags were replicated for the same repo	14:53
fungi	i think only new refs get replicated	14:53
fungi	gerrit doesn't normally do a full mirror operation with every event	14:54
clarkb	fungi: correct	14:54
fungi	tag objects themselves aren't parents of future commits	14:54
clarkb	the pushes are for specific refs. Only events updating those refs cause them to be pushed. Unless you do a full replication request via the ssh api or similar	14:54
frickler	so we might want to do a full mirror periodically just to be sure? like weekly maybe?	14:54
clarkb	I don't know that that is necessary now that gitea is fairly stable. But if we want to be extra sure that would probably be fine too	14:55
frickler	ok, so just doing it once for all hosts then might be good enough	14:56
fungi	for a spot check that replication will fix this, i did a `replication start openstack/python-openstackclient --url=gitea10` and https://gitea10.opendev.org:3081/openstack/python-openstackclient/src/tag/7.2.0/ is returning content now	14:56
fungi	i'll rerun without the repo and url patterns now to replicate everything to all backends	14:56
clarkb	ack	14:56
fungi	just a `replication start` in other words	14:56
frickler	the other thing would be to inform the release team to hold releases not only for gerrit restarts but also for gitea actions?	14:57
fungi	tailing the end of `gerrit show-queue` will give you a count of remaining tasks, most of which will be the replication	14:57
clarkb	normal gitea actions should be safe with replication. We have a specific shutdown and startup process for gitea that is meant to allow gerrit to retry replication until successful. But if there is an OOM and things are killed in the wrong order etc that won't work	14:58
fungi	baseline is typically around 20 tasks	14:58
fungi	task count has dropped below 13k now	14:58
clarkb	the underlying issue with replication and gitea shutdowns is that we need the web and db services to be up and running when pushes are made. That means the ssh service must start after those two are ready	15:06
clarkb	2529 package updates waiting for me. Looks like python313 stuff got updated	15:18
clarkb	oh and xfce	15:18
fungi	that's a lot of packages	15:19
clarkb	I wonder if the compiler abi got bumped	15:20
clarkb	replication seems to have completed	15:37
fungi	confirmed	15:40
clarkb	I think nodepool builders (for x86) will be the next conversion to noble. The way we do containers in containers to build rocky seems like another good exercise of the podman transition	15:49
clarkb	I believe it should be completely safe to spin up two new builders (nb05 and nb06) with the appropraite workspace volume attached, Add them alongside nb01 and nb02 then shutdown services on nb01 and nb02 with the hosts in the emergency file to force the new servers to build all the images. Maybe even make some explicit requests rather than waiting on the regular schedule	15:50
fungi	yeah, either they'll successfully build images, or they won't	15:51
fungi	worst case we end up with some stale images when they call dibs (pun intended) on a rebuild	15:51
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783	15:57
clarkb	thats the first step in just making sure this works in CI. If that looks good I'll work on booting the two new servers next	15:57
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	16:18
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783	16:29
clarkb	that is just a small test thing. Once that comes back happy next step is booting new nodes	16:29
clarkb	actually just found something else that needs correcting.	16:31
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783	16:37
clarkb	ok I think that may be it	16:37
clarkb	actually I should probably put that into a parent change so that we check focal and nobel both	16:37
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Test nodepool builders on Noble https://review.opendev.org/c/opendev/system-config/+/944783	16:41
opendevreview	Clark Boylan proposed opendev/system-config master: Make nodepool image export forward/backward compatible https://review.opendev.org/c/opendev/system-config/+/944788	16:41
opendevreview	Tristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url https://review.opendev.org/c/zuul/zuul-jobs/+/927582	16:43
opendevreview	Clark Boylan proposed opendev/system-config master: Rebuild our base python container images https://review.opendev.org/c/opendev/system-config/+/944789	16:55
clarkb	note to myself (and anyone else who is curious I guess): the current x86 nodepool builders have 2x1TB volumes merged together via lvm into a 2TB ext4 fs mounted on /opt for dib workspace area	17:03
clarkb	I think I can use the launch node flags that end up running https://opendev.org/opendev/system-config/src/branch/master/launch/src/opendev_launch/mount_volume.sh against the first 1TB volume. The I can manually add the second volume, format it, pvcreate on it, then vgextend and lvextend and finally resize the ext4 fs?	17:06
clarkb	or should I not let launch node do any of it so that the ext4 creation is aware of the total final size when allocating inodes etc?	17:06
corvus	is 1x 2tb volume an option?	17:06
clarkb	fungi: ^ do you know if extending the ext4 fs will allocate new resources for inodes etc?	17:06
clarkb	corvus: no rax classic limits volumes to 1TB	17:07
fungi	clarkb: ext4 scales inode limits by block count, i want to say	17:07
fungi	are we anywhere close to running out?	17:08
clarkb	not on nb01	17:08
clarkb	5% inode usage with 37% disk usage	17:08
clarkb	so maybe its fine. I'm happy to do it all from scratch if we think that is a better end result though	17:09
fungi	yeah, i wouldn't worry much about it for now	17:09
clarkb	ok those system-config changes to smoke test noble nodepool builder deployments are happy. Time to boot a new server	17:22
corvus	+2s from me	17:25
clarkb	the new server is booting. Once I've got nb05 sorted out I'll boot nb06 and then do a single change to add both to the inventory (updating 944783 to do that)	17:31
clarkb	the /opt mount point for the ephemeral drive and the nodepool dib workspace volume collided but I think I can just remove the ephemeral drive mount point from fstab and call it good (seems this was done with nb01)	17:44
clarkb	infra-root 104.130.253.28 is the new nb05 server. I think it lgtm so I'm going to repeat what I did to build an nb06 now	17:59
clarkb	resize2fs made me e2fsck first but otherwise I didn't hit anything unexpected	18:04
fungi	huh, normally that doesn't happen, but maybe the volume had gone too long without a fsck	18:11
clarkb	it was all brand new. I was wondering if it needed fsck data to resize and we didn't have any yet	18:12
fungi	oh, huh, why did you need to resize2fs then if it was new?	18:13
fungi	you started with a single volume and already had data in it, then wanted to add a second volume?	18:14
clarkb	fungi: because launch node creates the fs on the first volume but doesn't know how to handle two volumes in one vg/lv. So I let launch node do everything with one then add the second volume and extend things	18:14
clarkb	both volumes are newly created	18:14
clarkb	this is what I was aksing about above if I should do it all by hand. But it seems like this is working so I think its ok?	18:15
fungi	aha, another reason that volume-adding functionality in launch-node is probably not all that helpful (like where we carve them up into multiple logical volumes on mirror servers)	18:15
clarkb	fwiw we ended up with more inodes on the nb05 filesystem when I was done than we had on nb01 so I think we're ok	18:16
clarkb	not a ton more but almost 10% if my napkin math is correct	18:16
fungi	yeah, i expect nb01 has seen multuple increases to its volume group. might have started with a smaller single volume	18:16
fungi	then later migrated the extents to a larger volume, then later still added another volume	18:17
fungi	anyway, if we don't see other builders using anywhere near that many inodes, then no i still wouldn't worry about "fixing" them for the sake of consistency or anything	18:19
fungi	(i misunderstood earlier thinking you were asking if we should redo the existing volumes on the old builders)	18:20
clarkb	ya I was trying to figure out if it is better to mkfs once rather than mkfs and resize2fs	18:24
clarkb	which still isn't clear to me since we ended up with a bette situation than the preexisting setup	18:24
clarkb	nb06 is up at 104.130.127.175 with volumes sorted out. I'm working on dns and inventory changes now	18:25
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add nb05 and nb06 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/944794	18:29
opendevreview	Clark Boylan proposed opendev/system-config master: Deploy nb05 and nb06 Noble nodepool builders https://review.opendev.org/c/opendev/system-config/+/944783	18:33
clarkb	ok I think those two changes are ready to go. Feel free to jump onto those servers and double check what I did volume and lvm wise	18:34
clarkb	now is a good time to remind people to edit the meeting agnda or let me know what updates you'd like to see	18:36
clarkb	It is just about lunch time but I'll probably start on the meeting agenda afterwards	18:36
fungi	huh, 944789 failed system-config-build-image-uwsgi-base-3.10-bookworm on a segfault during pip install uwsgi	18:38
Clark[m]	Arg we've seen that before and had hacks like single thread uwsgi builds to work around it. I wonder though if we need the 3.10 image anymore	18:41
Clark[m]	I think lodgeit is the one thing we use depending on it	18:41
Clark[m]	So maybe we need to simplify image builds first?	18:41
fungi	do you recall why we install lodgeit on 3.10? lack of 3.11 support or just haven't gotten around to upgrading it?	18:42
Clark[m]	Oh I meant lodgeit is the only thing using the uwsgi image. Looks like it moved to 3.11 already	18:43
opendevreview	Merged opendev/zone-opendev.org master: Add nb05 and nb06 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/944794	18:43
Clark[m]	So we can likely drop that image build. I wonder if we can do the same with python-builder and python-base yet	18:44
fungi	yeah, the fewer jobs we need to run, the better all around	18:45
Clark[m]	I can check after lunch	18:46
Clark[m]	https://codesearch.opendev.org/?q=3.10-bookworm&i=nope&literal=nope&files=&excludeFiles=&repos= this quick check seems to show cleanup is safe.	18:49
clarkb	fungi: you approved https://review.opendev.org/c/opendev/system-config/+/944783 but didn't review its parent. Cna you do that too?	19:50
opendevreview	Clark Boylan proposed opendev/system-config master: Rebuild our base python container images https://review.opendev.org/c/opendev/system-config/+/944789	19:53
opendevreview	Clark Boylan proposed opendev/system-config master: Drop python3.10 container image builds https://review.opendev.org/c/opendev/system-config/+/944799	19:53
clarkb	and that is the cleanup of 3.10 image builds to hopefully make things happier building for 3.11 and 2.12	19:53
clarkb	*3.12	19:53
opendevreview	Merged zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook https://review.opendev.org/c/zuul/zuul-jobs/+/927600	19:59
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Pull minio from quay instead of docker hub https://review.opendev.org/c/zuul/zuul-jobs/+/944801	20:02
fungi	oh, thanks, i wondered why it wasn't enqueuing, i reviewed the depends-on but missed it had a parent commit as well	20:12
fungi	lgtm	20:13
clarkb	I'm going to drop the parallel infra-prod meeting agenda item. I think it is working well and not sure I have anything to say other than that	20:13
fungi	confirmed we install a /usr/local/bin/docker-compose as a wrapper around `docker compose` on noble	20:14
clarkb	yup	20:14
fungi	yeah, seems the parallel stuff is working well, only reason i can think to keep it on the agenda is we cranked its mutex up to 4x	20:14
clarkb	my first pass on meeting agenda edits is in	20:19
clarkb	heh now uwsgi 3.11 is afiling	20:20
clarkb	I'm going to look into what we did in the past to hack around its problems	20:20
clarkb	I'm quickly beginning to wonder if we should consider use something other than uwsgi	20:25
clarkb	but one step at a time	20:25
opendevreview	Merged opendev/system-config master: Make nodepool image export forward/backward compatible https://review.opendev.org/c/opendev/system-config/+/944788	20:26
opendevreview	Merged zuul/zuul-jobs master: Pull minio from quay instead of docker hub https://review.opendev.org/c/zuul/zuul-jobs/+/944801	20:28
opendevreview	Clark Boylan proposed opendev/system-config master: Rebuild our base python container images https://review.opendev.org/c/opendev/system-config/+/944789	20:31
fungi	wow, increasing pip verbosity fixed builds in the past? that's pure heisenbug	20:32
clarkb	yes	20:32
clarkb	we discovered this because we added ti for more debugging then it stopped failing. This was on bullseye not bookworm though so may be a noop here	20:33
clarkb	but I figured starting with teh old uwsgi compiles are flaky solution that adds more debuggign info was a good start	20:33
clarkb	but also it almost looks like gcc is segfaulting?	20:33
clarkb	which is very interesting	20:33
fungi	i'm taking that as a hint that i need a break before my 7pm conference call	20:33
clarkb	I'm going to propose a change to lodgeit that attempts to use https://pypi.org/project/granian/	20:35
clarkb	I think if we switch to granian we don't need a dedicated image bceause they are good and publish wheels for x86 and aarch64	20:35
fungi	looks promising	20:37
opendevreview	Merged opendev/system-config master: Deploy nb05 and nb06 Noble nodepool builders https://review.opendev.org/c/opendev/system-config/+/944783	20:45
opendevreview	Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805	20:47
clarkb	I don't know if ^ will work but its a start	20:47
clarkb	944783 is deploying all the things as expected. Probably half an hour or so away from deploying the new nodepool servers?	20:48
opendevreview	Clark Boylan proposed opendev/system-config master: Run lodgeit with granian https://review.opendev.org/c/opendev/system-config/+/944806	21:00
clarkb	I realized we don't have functional testing of lodgeit in lodgeit. ^ attempts to cover that gap	21:01
frickler	https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ :(	21:09
opendevreview	Stephen Reaves proposed openstack/diskimage-builder master: Enable custom overlays https://review.opendev.org/c/openstack/diskimage-builder/+/943500	21:12
clarkb	nepenthes is really neat. But I'm thankful we havne't had to go that route yet	21:14
clarkb	fungi: theory: what if uwsgi builds are failing because the python-builder image it is using is stale	21:18
clarkb	hrm no we have dependencies and requires on the other images so we should have latest and greatest across the board	21:20
clarkb	the nb05 and nb06 deployments completed and look ok at first glance. Once the hourlies finish for 21:00 I'll put nb01 and nb02 in the emergency file and shutdown their builders	21:22
clarkb	then we can request builds for things and see what happens	21:23
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add upload-image-swift role https://review.opendev.org/c/zuul/zuul-jobs/+/944812	21:40
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813	21:40
clarkb	I've rechecked 944806 with an autohold in place. Running the image locally I wasn't able to find anything that looked like the problem to me	21:50
clarkb	but I didn't have a db running to connect to and that is where it appears to be failing	21:50
clarkb	now to shutdown the old builders	21:51
clarkb	this is done	21:52
clarkb	I'll request an ubuntu noble image build and a rocky image build I guess	21:52
clarkb	corvus: question: does nodepool still get weird when the builder responsible for an image goes away?	21:55
clarkb	corvus: I'm wondering because I just stopped the builders and they aren't running yet. I want to say we've corrected most of the issues with that and what should happe nis nodepool will keep those images around as long as we need them but will let other builders delete those images if those builders are gone and we don't need the image anymore?	21:56
clarkb	I guess we'll fine out :/	21:57
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813	21:57
clarkb	I guess the worst case may be that we have to turn nb01 and nb02 back on temporarily to let them cleanup after themselves once we're happy with nb05 and nb06	21:57
corvus	clarkb: tbh not sure off the top of my head	21:58
corvus	clarkb: i think the uploads should get deleted; i'm not sure about the zk record for the on-disk build	22:00
clarkb	corvus: but the uploads should only delete after we've uploaded new replacement images from the new servers right? this won't be a case of accidetnal early cleanup	22:01
clarkb	(fwiw I haven't seen images get deleted yet so I think that is the case)	22:01
clarkb	and I can followup with nodepool later once we have an image or two build by the new severs that cycle things	22:01
corvus	oh yeah, i don't think there should be any accidental early cleanup	22:01
corvus	i think there's a good chance the automatic cleanup should work too without needing to bring the old ones online	22:02
clarkb	excellent I'll monitor and see where we end up	22:02
corvus	(i think it will delete all the uploads, then delete the zk build records when all upload records are gone; and only then would the old builders delete their local files, but we don't care about the last step)	22:02
clarkb	ack	22:02
corvus	but again, that's only once it's decided the old uploads should be expired due to new uploads; i don't expect it to start that process before then	22:03
clarkb	cool that seems to be in line with what I see so far (but we don't have any new images uploaded yet). Each new builder is building one image. They are both caching git repos now which isn't super fast the first time	22:03
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813	22:12
clarkb	so the problem with lodgeit and granian is that granian expects to call the application(environ, start_response) wsgi interface but we instead seem to have a factory for those interfaces and I guess uwsgi is smart enough to call the factory instead?	22:34
clarkb	anyway its going to require some refactoring to get this to work I think	22:34
opendevreview	Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805	22:39
opendevreview	Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805	22:41
opendevreview	Clark Boylan proposed opendev/system-config master: Run lodgeit with granian https://review.opendev.org/c/opendev/system-config/+/944806	22:41
clarkb	I think that may function now but I'm not sure that it is production quality yet. I think we need to memoize some stuff	22:42
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813	22:49
clarkb	last call on meeting agenda items. I've got a meeting and then I'm sending it out around 00:00 UTC	22:50
clarkb	the new ubuntu noble image built and is uploading niow	23:00
clarkb	rockylinux image built too so these builders are looking happy so far	23:41
clarkb	the latest patches for lodgeit under granian do seem to work. I think the next step there is making it more efficient which I'll have to look into tomorrow	23:55
clarkb	and now time to get the meeting agenda out	23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!