Monday, 2025-06-30

*** cloudnull10 is now known as cloudnull1		04:47
*** elodilles is now known as elodilles_pto		06:14
mnasiadka	morning	06:25
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:27
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:28
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:29
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:29
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:30
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:32
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:34
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:38
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:38
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add centos-10-stream image definition https://review.opendev.org/c/opendev/zuul-providers/+/953726	06:41
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds https://review.opendev.org/c/opendev/zuul-providers/+/953460	06:42
mnasiadka	I think https://review.opendev.org/c/opendev/zuul-providers/+/953269 needs another push to be gated	06:43
opendevreview	James E. Blair proposed openstack/project-config master: Reapply "Switch all Zuul tenants to use niz nodesets" https://review.opendev.org/c/openstack/project-config/+/953769	13:41
corvus	i think we're ready to try that again ^	13:50
opendevreview	Clark Boylan proposed opendev/system-config master: Replace zk06 with zk01 https://review.opendev.org/c/opendev/system-config/+/951164	15:02
clarkb	corvus: that change seems fine to me +2. I'd also like us to try and do ^ if we think today is agood day for it. I had to rebase to address a merge conflict with the change to zk12's inventory details though which is why it just got a new patchset	15:02
clarkb	corvus: fungi: I think the big things today are openstack and starglinx DCO stuff and then zuul-launcher? If we think that we're good to proceed with zk updates in the background while doing those I'd appreciate reviews/+2s	15:03
clarkb	https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2025 is the process notes for that. I just checked and zk05 is still the leader so starting with zk06 should be good	15:03
corvus	lgtm	15:04
fungi	yeah, once i'm out of meetings i can take a closer look	15:07
clarkb	thanks!	15:07
clarkb	seems like it is fairly quiet and I think I should be able to work through the process today if others aren't worried about potential conflicts with other tasks	15:08
opendevreview	Clark Boylan proposed opendev/system-config master: Add a summitspeakers@lists.openinfra.org mailing list https://review.opendev.org/c/opendev/system-config/+/953783	15:10
fungi	oh that's a neat idea	15:10
mnasiadka	clarkb: Any idea what to do with https://review.opendev.org/c/opendev/zuul-providers/+/953269? It seems it's randomly failing at least one/two/three jobs on source-repositories element (fetching that number of repos must be fun for gitea)	15:36
clarkb	mnasiadka: the latest round of failures were actually due to failures to upload the resulting image to swift	15:38
mnasiadka	uh, now I see that	15:38
mnasiadka	Actually not	15:38
mnasiadka	source-repositories is not failing properly, and then it goes to post to publish the image which is not there	15:39
mnasiadka	see https://zuul.opendev.org/t/opendev/build/9a3728c7cdc242c9ba6158ada739445f	15:39
clarkb	https://zuul.opendev.org/t/opendev/build/03c946441bee4ba79614e3deaa85e39b/log/job-output.txt#9761-9884 this looks successful to me	15:39
clarkb	so anyway I think we have two failure modes going on here. One that is less in our control (but we can provide feedback to jamesdenton__ and dan_with about) and then the "we might be making too many requests to gitea all at once" problem whihc we have more control over	15:40
jamesdenton__	clarkb i think there was a maint in dfw prod, if you're having api issues there	15:40
clarkb	jamesdenton__: this is sjc3 I think	15:40
clarkb	jamesdenton__: https://zuul.opendev.org/t/opendev/build/04843a0faeee4888ac32d1acb5f361c9/log/job-output.txt#9342 this specifically	15:41
clarkb	one option for the git fetches would be to create a job semaphore for those builds to reduce the total concurrency between themselves	15:41
corvus	fetches are a small part of those builds	15:42
clarkb	mnasiadka: ^ I think that may be where I would start. In the old system we only ever ran a maximum of 3 builds at the same time. Now we're running like 15. Maybe we set a semaphore of 5?	15:42
clarkb	corvus: true, the semaphore if a very heavy weight solution.	15:42
jamesdenton__	thanks clarkb - looking into it	15:43
corvus	how about changing the element to retry fetches?	15:43
mnasiadka	that's another option I was thinking about	15:43
corvus	what are the errors?	15:43
corvus	here's one: 2025-06-30 12:40:16.384 \| fatal: unable to access 'https://opendev.org/openstack/devstack-plugin-open-cas.git/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.	15:43
clarkb	that may be worth trying as well. One suspicion I have is that this could be related to the total number of connections more than the total load. So if we don't see retries make things better we may need a different approach but I would be open to trying that	15:43
corvus	are they all like that?	15:43
clarkb	corvus: yes I think so	15:44
fungi	clarkb: mnasiadka: skimming that swift error briefly, why are we uploading a vhd there?	15:44
fungi	i thought we only used those for rax classic due to xen	15:45
fungi	oh, wait, this is temporary storage, ignore me!	15:45
clarkb	fungi: yes, this is the intermediate step where we build the image then create a downloadable location for it in swift so that the followup publishing process can fetch and push to clouds as appropriate	15:45
clarkb	yup exactly	15:45
corvus	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-6h&to=now&timezone=utc	15:45
corvus	there is a ramp up in connections from 12:30-1300	15:46
fungi	but yeah, looks like that error could be some sort of middlebox terminating the upload prematurely, the task starts at 13:31:36 and gets aborted at 13:48:53 (~17.5 minutes into the upload)	15:47
corvus	looking at all the failures in https://zuul.opendev.org/t/opendev/buildset/403de0442fea49d78d5e5a1ded814ad3 it looks like 2 of them are git failures, and 5 of them are swift upload failures?	15:48
clarkb	corvus: mnasiadka I wonder if there is a way to have git keepalive a connection across git repos. Probably not but if there is something like ssh control persistence that may help here	15:48
clarkb	corvus: yes I think the latest buildset is mostly swift failures. But previous buildsets have largely been gitea failures (but fewer of them overall)	15:48
clarkb	corvus: mnasiadka it is also possible that we're not the actual source of the problem we're just more likely to hit issues in a noticeable way if there is external pressure	15:51
clarkb	with the old system the build would fail then go back into the queue and eventually get rebuilt. Now we get a failure and don't retry the entire set until the next trigger?	15:52
clarkb	in that case retries would probably be sufficient	15:52
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add image-build-jobs semaphore to limit concurrent jobs to 5 https://review.opendev.org/c/opendev/zuul-providers/+/953797	15:54
mnasiadka	Did this, but will have a look in the retries tomorrow	15:54
mnasiadka	and check why source-repositories is not failing the job properly	15:55
corvus	i mean, i left a -1 on that, but i feel like it's much more of a -2	15:55
corvus	i'm like -1.9 on that	15:55
mnasiadka	sure, thought it's a fast option - but I guess it's better to fix the DIB element	15:55
corvus	yeah, it's an easy option with a big negative impact on image build jobs	15:56
clarkb	ok https://review.opendev.org/c/opendev/system-config/+/951164 has a +1 again if we want to proceed with replacing the first zookeeper server cc fungi	15:56
corvus	https://zuul.opendev.org/t/opendev/buildsets?pipeline=periodic-image-build are generally doing pretty well. they take about 2 hours to run. with a 5-job semaphore we would double that to 4 hours.	15:58
corvus	fixing the element should reduce the few failures we have there too	15:58
fungi	clarkb: was the latest revision of 951164 just a rebase to resolve a merge conflict?	16:00
clarkb	fungi: yes, ze12 was updated on the lines just about zk01 in the einventory file and that created a conflict	16:01
clarkb	so rereview is more about "are we cool with trying to do this today" more than the actual code change itself I think	16:01
fungi	are you ready for me to approve it now, or want to schedule that?	16:01
clarkb	fungi: I think we either do it now or wait for wednesday (basically I want that change to go in early relative to my day start os that I can try and get through the whole cluster over a single work day)	16:02
clarkb	so yes now is fine, or if you'd prefer we schedule it we can look at wednesday?	16:02
fungi	just making sure you're already on hand and didn't want me to wait a few minutes	16:03
clarkb	yup today works for me if we start now	16:03
fungi	it's in the queue now	16:03
clarkb	thanks!	16:03
clarkb	once we get a little closer to that change merging I'll put zk06 in the emergency file and stop zookeeper there	16:05
opendevreview	Merged openstack/project-config master: Reapply "Switch all Zuul tenants to use niz nodesets" https://review.opendev.org/c/openstack/project-config/+/953769	16:08
opendevreview	James E. Blair proposed openstack/project-config master: Add zuul-launcher max servers https://review.opendev.org/c/openstack/project-config/+/953803	16:12
corvus	i see some unexpected errors in the launcher:	16:26
corvus	2025-06-30 16:24:47,129 ERROR zuul.Launcher: [e: e8290ecbf5e947eba95425f2affb7d21] [req: 120137a5e4f544e6bcf89779b4c23177] Error in creating the server. Compute service reports fault: Build of instance f408fcde-0e98-4c99-95ef-c7a26bb93752 aborted: Image a6cca26c-0dae-46c0-91c8-35ce5b48cec6 is unacceptable: Image not in a supported format	16:27
corvus	2025-06-30 16:24:47,130 ERROR zuul.Launcher: [e: e8290ecbf5e947eba95425f2affb7d21] [req: 120137a5e4f544e6bcf89779b4c23177] Marking node <OpenstackProviderNode uuid=233c355a9ca14241b2841a668d6ce14a, label=niz-ubuntu-noble-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-dfw3-main> as failed	16:27
corvus	that's an image upload that finished.... right around that time	16:27
corvus	2025-06-30 16:24:00,950 DEBUG zuul.Launcher: Released upload lock for <ImageUpload 65c1f4978e5f41f9a7cb91592d61606f state: ready endpoint: raxflex/raxflex-DFW3 artifact: 3a52e8133fe542f4a35076804b6a3926 validated: True external_id: a6cca26c-0dae-46c0-91c8-35ce5b48cec6>	16:28
corvus	i wonder if we're not waiting for a post-processing step	16:28
clarkb	image not in a supported format would imply something along those lines. Do you know what format we're uploading to that provider?	16:31
clarkb	if its qcow2 I wonder if the backend is converting to raw	16:31
corvus	qcow2	16:33
clarkb	a lot of clouds convert to raw on the backend and when we've had problems with them doing that we've just uploaded raw images instead	16:35
clarkb	I know with ceph this is adviseable because you get better copy on write behavior from ceph (ironically)	16:35
corvus	https://paste.opendev.org/show/b4UzBVq99xFIloG7gt8T/	16:35
corvus	that's what the 'image show' says	16:36
clarkb	ya I think the backend conversions are hidden from us. Glance just repeats back what we upload? I think we can confirm something changes by comparing the checksum values though	16:37
corvus	do you know what the "checksum" field means?	16:37
clarkb	iirc the checksum in image show is glance calculated and if it chagnes the details on the backend that checksum won't match what we uploaded (this is am ajor flaw in glance that I've brought up before that I gave up on caring about when no one seemed to understand why it was problematic that i couldn't confirm the file I uploaded matched what glance thinks it has)	16:38
clarkb	corvus: I believe it is an md5 checksum of the image	16:38
corvus	interesting! they are supposed to match in rax-flex, but they do not for this image	16:40
corvus	iow, all the other working images have matching checksums, but this one does not	16:40
clarkb	I wonder if this is just a "normal" bitflip type of situation then	16:40
clarkb	or maybe a short write	16:40
clarkb	and not a supported format error messages are how nova / glance's internal checking are failing	16:41
corvus	we didn't log any errors during the upload	16:42
corvus	unfortunately, i have left https://review.opendev.org/931824 unaddressed, so i don't think we can localize the error further. i will implement that as pennance.	16:43
corvus	meanwhile, i will delete that upload and let it try again. we may get more data that way	16:43
corvus	actually... 1 sec. we can see if we have the same error in sjc3	16:44
corvus	because they should have come from the same file	16:44
clarkb	can also cross check the md5 checksums	16:46
clarkb	if they don't match then it would be region specific upload issue maybe	16:46
corvus	++	16:46
clarkb	I have put zk06.opendev.org in the emergency file. We're a few minutes away from merging the chagne to replace it. I'll shut down zookeeper there when we're a bit closer	16:47
fungi	thanks!	16:49
corvus	only the sjc3 upload is affected. unfortunately, two different launchers uploaded those. that isn't normally supposed to happen (we're supposed to try to download the image only once), but it can happen as a fallback.	16:50
corvus	grr i mean, only the dfw3 upload is affected. sjc3 is fine.	16:50
corvus	oh i think i see it	16:54
corvus	remember those api errors mentioned earlier?	16:54
clarkb	ya they affected uploads into the cloud regions or maybe we got short writes?	16:55
corvus	yeah. still digging a bit, but we definitely hit one during the upload. we retried the upload, but i'm guessing something persisted, since i think we used the same image name.	16:55
corvus	(that's why we switched launchers -- the second upload happened from the second launcher)	16:56
*** iurygregory__ is now known as iurygregory		16:58
corvus	i think we need a "delete and replace upload" function, and to call it in cases like this. we probably shouldn't try to repeat a failed upload under the same name without deleting.	16:58
clarkb	makes sense particularly for multi part uploads	16:59
clarkb	it might consider some parts of the upload to be successful even when it shouldn't	16:59
clarkb	ok testinfra tests are running now. The chagne should merge shortly. I'm going to stop zookeeper on zk06 now	17:02
clarkb	connections appear to have rebalanced according to stat to both 04 and 05 as expected and 05 remains the leader	17:03
opendevreview	Merged opendev/system-config master: Replace zk06 with zk01 https://review.opendev.org/c/opendev/system-config/+/951164	17:07
clarkb	ok the zookeeper deployment job is finally running. This should bring up zk01 in a half working state. I need to restart zk04 and zk05 to pick up the new server listings (in that order so the leader goes last) after the deployment is done	17:37
clarkb	nodepool zk configs also updated at least on nl05 but services didn't restar there. Once I've got the cluster running in a three node state happily again I plan to restart nodepool services and maybe a zuul merger or two just to sanity check the connectivity to the new shape of the cluster	17:38
opendevreview	Jeremy Stanley proposed openstack/project-config master: Replace StarlingX's CLA enforcement with the DCO https://review.opendev.org/c/openstack/project-config/+/953819	17:39
fungi	ildikov: i'll plan on merging that ^ tomorrow	17:39
clarkb	`echo mntr \| nc localhost 2181 \| grep followers` shows 2 synced followers with one of them being a non voting follower	17:40
clarkb	whcih is what I think we expect since zk01 hasn't been inducted into the cluster yet (need to restart the leader for that)	17:40
clarkb	ok cool zookeeper job completed successfully. I'm going to double check all three nodes have the right server list and that zk05 is still the leader then I'll restart zk04, check the state of things again then restart zk05	17:41
clarkb	yup this looks good I'm going to restart zk04 now	17:42
clarkb	I think that went well. zk05 is still the leader and still reports the same follower counts	17:44
clarkb	ok I'm proceeding with a restart of the zk05 leader next	17:44
clarkb	zk04 is the new leader	17:45
clarkb	there are two synced followers and no nonvoting follwoers. I think everything has caught up now	17:46
clarkb	I need to edit the etherpad because it assumed zk05 would stay the leader but zk04 is leader so things change. THen I need to push up a change to remove zk05 and replace it with zk02 to get that testing then I can restart nodepool and a zuul merger or three	17:47
opendevreview	Clark Boylan proposed opendev/system-config master: Replace zk05 with zk02 https://review.opendev.org/c/opendev/system-config/+/953821	17:51
clarkb	infra-root ^ quick review on that is appricated	17:52
clarkb	I'm going to restart nodepool services now	17:52
clarkb	most new connections are going to zk05 but at least one made it to zk01	17:58
clarkb	corvus: do you think it makes sense to restart zuul mergers too to try and get connections onto zk01 or just trust that since the connection count there is 2 that we're probably good enough?	18:00
corvus	it really does, by pure coincidence, look like about 50% of the openstack node traffic is going to niz, the other is still nodepool. 53% to be more exact. that's based on the labels we just happened to have switched so far.	18:01
clarkb	nice	18:01
corvus	clarkb: maybe try a merger or two?	18:01
clarkb	on it	18:01
corvus	just as a canary in case there's something we're not seeing	18:02
corvus	i'm going to start merging the label-switch changes to drive more niz traffic.	18:02
opendevreview	Merged opendev/zuul-providers master: Switch centos-9-stream from nodepool to niz https://review.opendev.org/c/opendev/zuul-providers/+/952715	18:03
clarkb	I did zm01 and zm02. zm01 moved from zk04 to zk05 and zm02 movedfrom zk04 to zk04 (it reconnected to the same one)	18:04
clarkb	I'll keep doing more to see if I can get at least one to go to zk01	18:04
clarkb	zm03 connected to zk01	18:05
clarkb	zm04 connected to zk05. I think thats probably good though. We have nb05 and zm03 talking to zk01	18:07
*** iurygregory_ is now known as iurygregory		18:07
corvus	\o/	18:08
opendevreview	James E. Blair proposed openstack/project-config master: Grafana: update zuul status page for NIZ https://review.opendev.org/c/openstack/project-config/+/953823	18:09
clarkb	I've approved https://review.opendev.org/c/opendev/system-config/+/953821 to keep things moving forward	18:12
clarkb	corvus: I don't see zk01 in the grafan graphs for zuul. DO you know if I have to do anything special to get that data?	18:13
clarkb	oh I see we filter on those node namse in the graph definition	18:13
clarkb	I'll just do a followup to update the graphs and not worry about it at the moment	18:13
clarkb	unless you think that is important	18:13
corvus	followup sounds fine	18:14
corvus	we're collecting the data	18:14
corvus	(it's in graphite)	18:14
clarkb	cool	18:14
corvus	https://opendev.org/airship/promenade/src/branch/master/zuul.d/nodesets.yaml	18:19
corvus	i wonder what the story is there	18:19
clarkb	might be like devstack where they are relying on the groups values to do special stuff in the jobs?	18:20
corvus	maybe. surprised to see a group with the same name as a node though	18:20
opendevreview	Merged opendev/zuul-providers master: Switch debian-bookworm to niz https://review.opendev.org/c/opendev/zuul-providers/+/952716	18:21
opendevreview	Merged opendev/zuul-providers master: Switch debian-bullseye-arm64 nodesets to niz https://review.opendev.org/c/opendev/zuul-providers/+/952717	18:21
clarkb	corvus: after I've got zk02 enrolled in the cluster and I've checked things out the next step is to do the out of band update for zuul's zookeeper connection to tell it to connect to all three new nodes (while only two are up and running) then do a zuul restart. I may have questions / ask for help at that point depending on how thinsg are going. Also I wonder if we should restart	18:21
clarkb	executors non gracefully but then do the schedulers one by one to avoid a web ui and event handling outage?	18:21
clarkb	this way we only have to do the big zuul restart once	18:22
corvus	that plan sgtm	18:22
corvus	launchers one at a time too	18:22
clarkb	ack	18:22
corvus	clarkb: do a hard stop of the executors all at once, that way will minimize jobs rolling over to executors that themselves are about to stop.	18:24
clarkb	corvus: best way to do that is with ansible-playbook running zuul_stop.yaml with a limit to the executors?	18:24
corvus	that should work. i have developed a habit of just doing an ad-hoc ansible command that does 'docker-compose down'. :)	18:25
clarkb	hapyp to reuse what you know works too :)	18:26
clarkb	something like `ansible -f 20 zuul-executor -m shell -a 'cd /etc/zuul-executor && docker-compose down'` then run a similar docker-compose up -d after	18:27
clarkb	that may be best actually because then I can do each class of zuul service one by one more easily and give them the different behavior they want	18:28
corvus	yep exactly	18:28
corvus	that's mostly because i'm too lazy to go look at what the playbook does. not that i necessarily think it's better. :)	18:29
opendevreview	Merged opendev/system-config master: Replace zk05 with zk02 https://review.opendev.org/c/opendev/system-config/+/953821	18:29
clarkb	ok I'm going to stop zookeeper on zk05 now that ^ has merged	18:29
clarkb	zk04 remains the leader	18:30
opendevreview	Merged opendev/zuul-providers master: Switch ubuntu-jammy-arm64 labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/952718	18:39
opendevreview	Merged opendev/zuul-providers master: Switch to using niz for ubuntu-noble-arm64 https://review.opendev.org/c/opendev/zuul-providers/+/952719	18:39
opendevreview	Merged opendev/zuul-providers master: Switch to using niz labels for rockylinux-8 https://review.opendev.org/c/opendev/zuul-providers/+/952720	18:39
opendevreview	Merged opendev/zuul-providers master: Switch to using niz for rockylinux-9 https://review.opendev.org/c/opendev/zuul-providers/+/952721	18:39
opendevreview	Clark Boylan proposed opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824	18:45
clarkb	I realized I could get things rolling with testing of ^ since I know which server is going to be replaced last now	18:45
opendevreview	Clark Boylan proposed openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03 https://review.opendev.org/c/openstack/project-config/+/953825	18:58
clarkb	zookeeper job has finished. I'm going to do my checks across zk01, zk02, and zk04 now. Then will restart zk01 and zk04 once I'm happy	19:01
clarkb	yup everything looked as I expect it to so I've already restarted zk01 since it was follower. Going to restart zk04 now which is the current leader	19:04
clarkb	zk02 is the new leader and it has 2 synced followers and no nonvoting followers	19:05
clarkb	so our cluster is now zk01, zk02, and zk04 with zk04 as leader. Most services are only aware of zk04 as a valid connection point. I'm going to go restart nodepool services and zuul mergers like I did last time now which should have them connect to zk01 and zk02 and zk04 as valid options	19:06
clarkb	then we can look into the bigger zuul restart after updating its config	19:07
clarkb	oh sorry zk02 is the leader not zk04	19:07
opendevreview	Clark Boylan proposed opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824	19:10
clarkb	there was a test failure in ^ due to my updating the zk04 node in tests to zk99. Should be fixed now	19:10
clarkb	nodepool services have restarted I believe nl05 is connected to zk02	19:16
clarkb	I'm going to run my zuul config update playbook against the zuul mergers and do them in bulk	19:17
clarkb	ok I think that lgtm (zuul-mergers are done and are running with config for all three new servers now)	19:22
clarkb	corvus: I think I want to do executors last so that test results for 953824 are not lost and have to restart	19:23
corvus	ack	19:23
clarkb	corvus: I'll do the launchers next. Can I hard stop/start them like mergers? looks like yes?	19:24
clarkb	how much of a delay between each?	19:24
corvus	yep... it only takes them like 10 seconds to come up	19:25
clarkb	got it. Working on that next	19:25
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: WIP: source-repositories: make git retry https://review.opendev.org/c/openstack/diskimage-builder/+/953829	19:27
clarkb	zl01 is done. I'll pause for a moment then do zl02	19:28
clarkb	and now zl02 is done	19:29
clarkb	arg I didn't properly fix the zookeeper tests so I need new test results anyway	19:29
clarkb	considering that I think I'll proceed with executors next now that launchers are done	19:29
clarkb	I'm going to docker compose down all of them then docker compose up -d all of them	19:30
clarkb	that is done	19:33
corvus	i see nothing exploding on the launchers	19:34
clarkb	corvus: you good with me proceeding to do scheduler and web on zuul01 then zuul02 once zuul01 is up again now?	19:34
clarkb	I've updated configs on zuul01 and zuul02. I'll proceed with restarting scheduler and web services on zuul01 now	19:37
corvus	yep	19:38
clarkb	ok zuul01 has has its services restarted	19:39
clarkb	zuul01 web says it is stopped in the components list	19:40
clarkb	now it says it is initializing. I just had to wait a bit I guess	19:40
opendevreview	Clark Boylan proposed opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824	19:41
clarkb	hopefully that fixes testing	19:41
clarkb	corvus: /components says zuul01 scheduler is running now. Thats super quick these days.	19:44
corvus	:)	19:44
clarkb	and now zuul web is running. I'm goign to do zuul02 nowish	19:44
clarkb	the hourly run for zuul will reset the zuul.conf to contain zk01,zk02, and zk04 which is fine since zuul doesn't look at that config except at startup so these restarts should have them all looking at zk01,zk02, and zk03. Then by the time the next zuul restart happens we should have all the zookeepers updated in our inventory	19:46
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581	19:47
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout https://review.opendev.org/c/openstack/diskimage-builder/+/721581	19:48
clarkb	#status log Restarted all of zuul to pick up new zookeeper server configuration	19:52
opendevstatus	clarkb: finished logging	19:52
clarkb	we are now at merge https://review.opendev.org/c/opendev/system-config/+/953824 to remove zk04 and add zk03.	19:52
clarkb	I think it should pass testing now too :)	19:52
opendevreview	Merged openstack/project-config master: Make scripts DCO compliant https://review.opendev.org/c/openstack/project-config/+/950770	19:55
corvus	i'm not sure i've seen this keystone error before: https://paste.opendev.org/show/bZd9L8ZsYwelQ1xSz6eY/	20:00
corvus	i'm not worried about it atm; it looks like it was a one-off.	20:00
opendevreview	James E. Blair proposed opendev/zuul-providers master: Add new-style node labels https://review.opendev.org/c/opendev/zuul-providers/+/953832	20:01
opendevreview	James E. Blair proposed opendev/zuul-providers master: Switch all nodepool labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/953833	20:01
corvus	i'm ready to do that ^ to drive up the niz traffic more.	20:01
clarkb	re the error I've not seen that one before	20:04
clarkb	corvus: fungi https://review.opendev.org/c/opendev/system-config/+/953824 passed its zookeeper test job this time. Can I get a review and/or approval if it looks good?	20:06
corvus	lgtm, despite the trailing whitespace :)	20:07
frickler	hmm, I just noticed that I lost all my bash history of zuul-client commands on zuul02. also "sudo zuul-client autohold-list" first downloaded the container and then fails because the config seems broken. is this expected and I need to switch to something new?	20:08
clarkb	frickler: the server was replaced over the weekend	20:09
clarkb	I suspect that explains both behaviors. Do we need to issue a new auth jwt token on those servers?	20:09
corvus	zuul-client autohold-list --tenant openstack works for me?	20:10
frickler	oh, wait, I was missing the --tenant argument. adding that the command works fine indeed. so I just need to re-discover the proper commands I guess	20:11
corvus	yes, if that argument is omitted, it outputs this error: zuulclient.cmd.ArgumentException: Error: the --tenant argument or the 'tenant' field in the configuration file is required	20:11
opendevreview	James E. Blair proposed opendev/zuul-providers master: Switch all nodepool labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/953833	20:19
corvus	okay that's ready, now with fewer config errors :)	20:20
clarkb	corvus: and once those go in we shutdown nodepool launchers?	20:22
clarkb	put another way what is the coordination point for that change	20:22
clarkb	maybe zuul simply prefers niz over nodepool?	20:22
corvus	yeah, zuul prefers niz, no shutdown necessary	20:25
corvus	we may still have a few niz requests (cf those last couple of images -- need to check where we are on that)	20:26
corvus	so next step would be to check the delta again, close the gap by adding any new labels necessary, then we can shutdown nodepool (if desired, but not required).	20:27
corvus	if anything goes wrong, we can just revert that change and it's instantly back to nodepool.	20:27
opendevreview	Merged opendev/zuul-providers master: Add new-style node labels https://review.opendev.org/c/opendev/zuul-providers/+/953832	20:28
opendevreview	Merged opendev/zuul-providers master: Switch all nodepool labels to niz https://review.opendev.org/c/opendev/zuul-providers/+/953833	20:28
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	20:28
corvus	yeah those :)	20:28
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	20:31
opendevreview	James E. Blair proposed opendev/zuul-providers master: Drop niz- label prefix from nodesets https://review.opendev.org/c/opendev/zuul-providers/+/953835	20:33
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	20:33
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	20:34
opendevreview	Michal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config https://review.opendev.org/c/opendev/zuul-providers/+/953269	20:36
opendevreview	James E. Blair proposed opendev/zuul-providers master: Remove "normal" labels, etc https://review.opendev.org/c/opendev/zuul-providers/+/953836	20:44
clarkb	I've done a first pass update at our meeting agenda. I dropped the centos 10 dib stuff as I think that is largely done now. I added some notes about zuul launcher status and zookeeper and zuul server replacements	21:00
clarkb	anything else to add?	21:00
corvus	that probably covers all the changes i'd ask to make	21:00
fungi	nothing here	21:03
opendevreview	James E. Blair proposed opendev/zuul-providers master: Increase launch-timeout for rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/953841	21:30
corvus	i notice that rax-iad is turned off; what's the status with that?	21:31
clarkb	corvus: I think that may have gotten lost when I ended up afk for a while	21:32
clarkb	corvus: I suspect we can try turning it back on now and see if we get better behavior	21:32
corvus	ack. maybe leave it be for a bit, until we run out of other node-related things to change. :)	21:34
opendevreview	Merged opendev/system-config master: Replace zk04 with zk03 https://review.opendev.org/c/opendev/system-config/+/953824	21:36
clarkb	I will shutdown zookeeper on zk04 to get ahead of ^ momentarily	21:37
clarkb	thats done. zk02 remains leader and has one follower currently	21:38
clarkb	after zk03 deploys I'll double check zk01 and zk02 still look good then restart zk01 then zk02 (since 02 is leader it goes last). Then I can restart all of the nodepool services one last time. Then we should be fully on the new cluster and can start looking at followup changes	21:40
clarkb	zookeeper job has comepleted successfully. I'll check things and then do the necessary restarts	22:08
clarkb	restarting zk01 make zk03 the leader	22:10
clarkb	but the zxid seems appropriate and zuul is returning expected data so I think zookeeper handled that properly	22:10
clarkb	I'm just surprised for it to become the leader before I restarted zk02	22:11
clarkb	zk02 went into a state where it stopped serving requests after that zk01 restart resulted in zk03 being leader. I believe that isb eacuse zk02 was running with a config that didn't allow for zk03. Restarting zk02 to pick up the config change has it working again as a follower	22:12
clarkb	corvus: ^ not sure if there is any other checking you want to do. I think this went well. I'm going to work on restarting nodepool services so they are aware of the new (and final) cluster state	22:14
corvus	clarkb: probably a good idea to do the graphs change now; would be good to double check those	22:15
clarkb	corvus: https://review.opendev.org/c/openstack/project-config/+/953825 it is ready if you are	22:16
corvus	clarkb: i think there's one more; left a comment	22:17
clarkb	fwiw zk02 reported 2 synced followers one of which was nonvoting prior to the zk01 restart. I think when I restarted zk01 it decided that zk03 was the best leader because zk02 and zk03 were synced and zk03 had the higher id value	22:17
clarkb	so I think I've convinced myself that this behavior is expected and not abnormal	22:17
clarkb	bah how did I miss that one	22:17
clarkb	then because zk02 didn't have config that knows about zk03 it basically went idle and stopped doing work	22:18
clarkb	it was a coup amongst the zks	22:18
opendevreview	Clark Boylan proposed openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03 https://review.opendev.org/c/openstack/project-config/+/953825	22:18
clarkb	corvus: ^ there	22:19
clarkb	nl07 is connected to zk02 (since zk03 became leader lots of stuff ended up over there and it is fine). I just need to finish restarting nl05 and nl06 then I think the application side of this move is done	22:23
clarkb	nl05 also connected to zk02	22:24
clarkb	all services have been restarted on configs with the three new nodes. All three new nodes are in place and seem to indicate that they are talking to each other via stat and mntr 4 letter commands. Services are stopped on the old servers and they are removed from the inventory	22:25
clarkb	I think confirming that grafana looks sane is the next step then we can do things like cleanup dns. I'll work on a change for that next	22:25
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Remove zk04-zk06 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/953844	22:29
opendevreview	Clark Boylan proposed opendev/system-config master: Cleanup zookeeper config management https://review.opendev.org/c/opendev/system-config/+/953846	22:32
opendevreview	Clark Boylan proposed opendev/system-config master: Remove docker compose version from zuul services https://review.opendev.org/c/opendev/system-config/+/953848	22:35
clarkb	ok last call on the meeting agenda. I'm going to send that out in ~10 minutes	22:36
clarkb	corvus: until we have live data we have this screenshot: https://58e37da755b04d7dc54b-c9565b17ee8809c34acc7a1ea3e19ee3.ssl.cf1.rackcdn.com/openstack/e7f70913b8b84f5394eaed9aa16995c6/screenshots/zuul-status.png	22:39
corvus	zk01 latency looks a little bit higher than ould old values (0.4ms) but probably within the margin of error	22:41
corvus	i don't see anything else unusual	22:41
corvus	s/ould/our/	22:41
opendevreview	Merged opendev/zuul-providers master: Increase launch-timeout for rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/953841	22:43
corvus	there's a handful of low-priority niz changes at https://review.opendev.org/q/hashtag:opendev-niz+status:open	22:44
clarkb	ack I'll review those as soon as I hit send on this meeting agenda	22:45
corvus	looks like we're at 87% niz and 13% nodepool	22:48
corvus	clarkb: since i had a shell open, i ran the cacti graph create script for zk01-zk03	22:52
clarkb	thanks!	22:52
opendevreview	Merged openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03 https://review.opendev.org/c/openstack/project-config/+/953825	22:54
clarkb	corvus: all those niz related stats changes and nodeset/label updates lgtm	22:54
corvus	\o/	22:54
clarkb	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-15m&to=now&timezone=utc seems to have updated now	22:57
clarkb	oh I'm only looking at the last 15 minutes	22:57
clarkb	you may wish to chagne the time range if you open that link	22:57
corvus	lgtm	22:58
clarkb	I'll probably leave the old servers up with their DNS records in place until at least tomorrow. I think this is a good pause point. THings seem to be working and we can see how zuul handles the periodic job enqueues later this evening before doing more destructive changes to the old servers	22:59
corvus	clarkb: mnasiadka why is https://review.opendev.org/953269 marked WIP?	22:59
clarkb	I don't know why	22:59
corvus	i don't see an explicit event for it	23:01
corvus	hypothesis: it was set by one of the rebase or web-based editing actions, possibly accidentally or without mnasiadka 's knowledge	23:02
corvus	oh, well it has a depends on anyway	23:03
clarkb	corvus: https://zuul.opendev.org/t/openstack/build/9707f99650374c28854400788ebfc0b1/log/job-output.txt#31-53	23:09
clarkb	corvus: I suspect that build got a nodeset built from nodes from two different providers	23:09
clarkb	corvus: note the Region values in particular (only one provider value is set)	23:10
clarkb	I noticed bceause the zookeeper test actually checks that external connectivity to the zookeeper non ssl port is blocked and I think it got confused running that test from bridge to the zk test server?	23:10
clarkb	and that resulted in a socket timeout error rather than socket no route to host error	23:11
clarkb	I think this may impact multinode testing in general since we try to set up vxlan tunnels for that over private ips iirc. But I can't be sure of that without more digging	23:11
corvus	it does look like that was from two different providers. the launcher should prefer to use a single provider	23:12
corvus	but it is capable of not doing so in extenuating circumstances	23:12
corvus	(this is a change from nodepool)	23:12
corvus	i'll try to see what the circumstances were	23:13
clarkb	right nodepool would only ever provide nodes from the same provider or fail	23:13
opendevreview	Jeremy Stanley proposed openstack/project-config master: Replace Airship's CLA enforcement with the DCO https://review.opendev.org/c/openstack/project-config/+/953849	23:13
clarkb	I see openstack grenade multinode jobs are succeeding in the gate so this isn't a hard fail for everything. That would support the extenuating circumstances theory	23:14
corvus	clarkb: the node was a leftover ready node from a previous canceled request (probably a previous patchset of the same change). it looks like we could be better about the interaction of ready nodes and provider preference.	23:28
clarkb	corvus: got it. THe other thing I wonder about is why the provider info is lost there	23:36
clarkb	maybe we don't carry that forward when reusing recycled nodes?	23:36
corvus	yeah, an unattached ready node loses its provider info, mostly so that it can float to other providers/tenants/etc that have the same config. we should re-assign it when we select it.	23:37
corvus	(it's a weird artifact of having multiple providers with the same endpoint, but they could be different in different tenants)	23:38
corvus	clarkb: fyi https://review.opendev.org/953851 is the simplest way i can think to improve that. there's a better way, but it'd make a very complicated method even more complicated, so i'd like to try simple first. :)	23:53
corvus	clarkb: i do agree that could end up causing problems for devstack runs, but hopefully not too many before we get that fixed. i think it's likely to be rare enough we can roll forward instead of back.	23:56

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!