Monday, 2025-06-30

*** cloudnull10 is now known as cloudnull104:47
*** elodilles is now known as elodilles_pto06:14
mnasiadkamorning06:25
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:27
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:28
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:29
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:29
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:30
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:32
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:34
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:38
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:38
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add centos-10-stream image definition  https://review.opendev.org/c/opendev/zuul-providers/+/95372606:41
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add CentOS Stream 10 builds  https://review.opendev.org/c/opendev/zuul-providers/+/95346006:42
mnasiadkaI think https://review.opendev.org/c/opendev/zuul-providers/+/953269 needs another push to be gated06:43
opendevreviewJames E. Blair proposed openstack/project-config master: Reapply "Switch all Zuul tenants to use niz nodesets"  https://review.opendev.org/c/openstack/project-config/+/95376913:41
corvusi think we're ready to try that again ^13:50
opendevreviewClark Boylan proposed opendev/system-config master: Replace zk06 with zk01  https://review.opendev.org/c/opendev/system-config/+/95116415:02
clarkbcorvus: that change seems fine to me +2. I'd also like us to try and do ^ if we think today is agood day for it. I had to rebase to address a merge conflict with the change to zk12's inventory details though which is why it just got a new patchset15:02
clarkbcorvus: fungi: I think the big things today are openstack and starglinx DCO stuff and then zuul-launcher? If we think that we're good to proceed with zk updates in the background while doing those I'd appreciate reviews/+2s15:03
clarkbhttps://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2025 is the process notes for that. I just checked and zk05 is still the leader so starting with zk06 should be good15:03
corvuslgtm15:04
fungiyeah, once i'm out of meetings i can take a closer look15:07
clarkbthanks!15:07
clarkbseems like it is fairly quiet and I think I should be able to work through the process today if others aren't worried about potential conflicts with other tasks15:08
opendevreviewClark Boylan proposed opendev/system-config master: Add a summitspeakers@lists.openinfra.org mailing list  https://review.opendev.org/c/opendev/system-config/+/95378315:10
fungioh that's a neat idea15:10
mnasiadkaclarkb: Any idea what to do with https://review.opendev.org/c/opendev/zuul-providers/+/953269? It seems it's randomly failing at least one/two/three jobs on source-repositories element (fetching that number of repos must be fun for gitea)15:36
clarkbmnasiadka: the latest round of failures were actually due to failures to upload the resulting image to swift15:38
mnasiadkauh, now I see that15:38
mnasiadkaActually not15:38
mnasiadkasource-repositories is not failing properly, and then it goes to post to publish the image which is not there15:39
mnasiadkasee https://zuul.opendev.org/t/opendev/build/9a3728c7cdc242c9ba6158ada739445f15:39
clarkbhttps://zuul.opendev.org/t/opendev/build/03c946441bee4ba79614e3deaa85e39b/log/job-output.txt#9761-9884 this looks successful to me15:39
clarkbso anyway I think we have two failure modes going on here. One that is less in our control (but we can provide feedback to jamesdenton__ and dan_with about) and then the "we might be making too many requests to gitea all at once" problem whihc we have more control over15:40
jamesdenton__clarkb i think there was a maint in dfw prod, if you're having api issues there15:40
clarkbjamesdenton__: this is sjc3 I think15:40
clarkbjamesdenton__: https://zuul.opendev.org/t/opendev/build/04843a0faeee4888ac32d1acb5f361c9/log/job-output.txt#9342 this specifically15:41
clarkbone option for the git fetches would be to create a job semaphore for those builds to reduce the total concurrency between themselves15:41
corvusfetches are a small part of those builds15:42
clarkbmnasiadka: ^ I think that may be where I would start. In the old system we only ever ran a maximum of 3 builds at the same time. Now we're running like 15. Maybe we set a semaphore of 5?15:42
clarkbcorvus: true, the semaphore if a very heavy weight solution.15:42
jamesdenton__thanks clarkb - looking into it15:43
corvushow about changing the element to retry fetches?15:43
mnasiadkathat's another option I was thinking about15:43
corvuswhat are the errors?15:43
corvushere's one: 2025-06-30 12:40:16.384 | fatal: unable to access 'https://opendev.org/openstack/devstack-plugin-open-cas.git/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.15:43
clarkbthat may be worth trying as well. One suspicion I have is that this could be related to the total number of connections more than the total load. So if we don't see retries make things better we may need a different approach but I would be open to trying that15:43
corvusare they all like that?15:43
clarkbcorvus: yes I think so15:44
fungiclarkb: mnasiadka: skimming that swift error briefly, why are we uploading a vhd there?15:44
fungii thought we only used those for rax classic due to xen15:45
fungioh, wait, this is temporary storage, ignore me!15:45
clarkbfungi: yes, this is the intermediate step where we build the image then create a downloadable location for it in swift so that the followup publishing process can fetch and push to clouds as appropriate15:45
clarkbyup exactly15:45
corvushttps://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-6h&to=now&timezone=utc15:45
corvusthere is a ramp up in connections from 12:30-130015:46
fungibut yeah, looks like that error could be some sort of middlebox terminating the upload prematurely, the task starts at 13:31:36 and gets aborted at 13:48:53 (~17.5 minutes into the upload)15:47
corvuslooking at all the failures in https://zuul.opendev.org/t/opendev/buildset/403de0442fea49d78d5e5a1ded814ad3 it looks like 2 of them are git failures, and 5 of them are swift upload failures?15:48
clarkbcorvus: mnasiadka I wonder if there is a way to have git keepalive a connection across git repos. Probably not but if there is something like ssh control persistence that may help here15:48
clarkbcorvus: yes I think the latest buildset is mostly swift failures. But previous buildsets have largely been gitea failures (but fewer of them overall)15:48
clarkbcorvus: mnasiadka it is also possible that we're not the actual source of the problem we're just more likely to hit issues in a noticeable way if there is external pressure15:51
clarkbwith the old system the build would fail then go back into the queue and eventually get rebuilt. Now we get a failure and don't retry the entire set until the next trigger?15:52
clarkbin that case retries would probably be sufficient15:52
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add image-build-jobs semaphore to limit concurrent jobs to 5  https://review.opendev.org/c/opendev/zuul-providers/+/95379715:54
mnasiadkaDid this, but will have a look in the retries tomorrow15:54
mnasiadkaand check why source-repositories is not failing the job properly15:55
corvusi mean, i left a -1 on that, but i feel like it's much more of a -215:55
corvusi'm like -1.9 on that15:55
mnasiadkasure, thought it's a fast option - but I guess it's better to fix the DIB element15:55
corvusyeah, it's an easy option with a big negative impact on image build jobs15:56
clarkbok https://review.opendev.org/c/opendev/system-config/+/951164 has a +1 again if we want to proceed with replacing the first zookeeper server cc fungi15:56
corvushttps://zuul.opendev.org/t/opendev/buildsets?pipeline=periodic-image-build are generally doing pretty well.  they take about 2 hours to run.  with a 5-job semaphore we would double that to 4 hours.15:58
corvusfixing the element should reduce the few failures we have there too15:58
fungiclarkb: was the latest revision of 951164 just a rebase to resolve a merge conflict?16:00
clarkbfungi: yes, ze12 was updated on the lines just about zk01 in the einventory file and that created a conflict16:01
clarkbso rereview is more about "are we cool with trying to do this today" more than the actual code change itself I think16:01
fungiare you ready for me to approve it now, or want to schedule that?16:01
clarkbfungi: I think we either do it now or wait for wednesday (basically I want that change to go in early relative to my day start os that I can try and get through the whole cluster over a single work day)16:02
clarkbso yes now is fine, or if you'd prefer we schedule it we can look at wednesday?16:02
fungijust making sure you're already on hand and didn't want me to wait a few minutes16:03
clarkbyup today works for me if we start now16:03
fungiit's in the queue now16:03
clarkbthanks!16:03
clarkbonce we get a little closer to that change merging I'll put zk06 in the emergency file and stop zookeeper there16:05
opendevreviewMerged openstack/project-config master: Reapply "Switch all Zuul tenants to use niz nodesets"  https://review.opendev.org/c/openstack/project-config/+/95376916:08
opendevreviewJames E. Blair proposed openstack/project-config master: Add zuul-launcher max servers  https://review.opendev.org/c/openstack/project-config/+/95380316:12
corvusi see some unexpected errors in the launcher:16:26
corvus2025-06-30 16:24:47,129 ERROR zuul.Launcher: [e: e8290ecbf5e947eba95425f2affb7d21] [req: 120137a5e4f544e6bcf89779b4c23177] Error in creating the server. Compute service reports fault: Build of instance f408fcde-0e98-4c99-95ef-c7a26bb93752 aborted: Image a6cca26c-0dae-46c0-91c8-35ce5b48cec6 is unacceptable: Image not in a supported format16:27
corvus2025-06-30 16:24:47,130 ERROR zuul.Launcher: [e: e8290ecbf5e947eba95425f2affb7d21] [req: 120137a5e4f544e6bcf89779b4c23177] Marking node <OpenstackProviderNode uuid=233c355a9ca14241b2841a668d6ce14a, label=niz-ubuntu-noble-8GB, state=building, provider=opendev.org%2Fopendev%2Fzuul-providers/raxflex-dfw3-main> as failed16:27
corvusthat's an image upload that finished.... right around that time16:27
corvus2025-06-30 16:24:00,950 DEBUG zuul.Launcher: Released upload lock for <ImageUpload 65c1f4978e5f41f9a7cb91592d61606f state: ready endpoint: raxflex/raxflex-DFW3 artifact: 3a52e8133fe542f4a35076804b6a3926 validated: True external_id: a6cca26c-0dae-46c0-91c8-35ce5b48cec6>16:28
corvusi wonder if we're not waiting for a post-processing step16:28
clarkbimage not in a supported format would imply something along those lines. Do you know what format we're uploading to that provider?16:31
clarkbif its qcow2 I wonder if the backend is converting to raw16:31
corvusqcow216:33
clarkba lot of clouds convert to raw on the backend and when we've had problems with them doing that we've just uploaded raw images instead16:35
clarkbI know with ceph this is adviseable because you get better copy on write behavior from ceph (ironically)16:35
corvushttps://paste.opendev.org/show/b4UzBVq99xFIloG7gt8T/16:35
corvusthat's what the 'image show' says16:36
clarkbya I think the backend conversions are hidden from us. Glance just repeats back what we upload? I think we can confirm something changes by comparing the checksum values though16:37
corvusdo you know what the "checksum" field means?16:37
clarkbiirc the checksum in image show is glance calculated and if it chagnes the details on the backend that checksum won't match what we uploaded (this is am ajor flaw in glance that I've brought up before that I gave up on caring about when no one seemed to understand why it was problematic that i couldn't confirm the file I uploaded matched what glance thinks it has)16:38
clarkbcorvus: I believe it is an md5 checksum of the image16:38
corvusinteresting!  they are supposed to match in rax-flex, but they do not for this image16:40
corvusiow, all the other working images have matching checksums, but this one does not16:40
clarkbI wonder if this is just a "normal" bitflip type of situation then16:40
clarkbor maybe a short write16:40
clarkband not a supported format error messages are how nova / glance's internal checking are failing16:41
corvuswe didn't log any errors during the upload16:42
corvusunfortunately, i have left https://review.opendev.org/931824 unaddressed, so i don't think we can localize the error further.  i will implement that as pennance.16:43
corvusmeanwhile, i will delete that upload and let it try again.  we may get more data that way16:43
corvusactually... 1 sec.  we can see if we have the same error in sjc316:44
corvusbecause they should have come from the same file16:44
clarkbcan also cross check the md5 checksums16:46
clarkbif they don't match then it would be region specific upload issue maybe16:46
corvus++16:46
clarkbI have put zk06.opendev.org in the emergency file. We're a few minutes away from merging the chagne to replace it. I'll shut down zookeeper there when we're a bit closer16:47
fungithanks!16:49
corvusonly the sjc3 upload is affected.  unfortunately, two different launchers uploaded those.  that isn't normally supposed to happen (we're supposed to try to download the image only once), but it can happen as a fallback.16:50
corvusgrr i mean, only the dfw3 upload is affected.  sjc3 is fine.16:50
corvusoh i think i see it16:54
corvusremember those api errors mentioned earlier?16:54
clarkbya they affected uploads into the cloud regions or maybe we got short writes?16:55
corvusyeah.  still digging a bit, but we definitely hit one during the upload.  we retried the upload, but i'm guessing something persisted, since i think we used the same image name.16:55
corvus(that's why we switched launchers -- the second upload happened from the second launcher)16:56
*** iurygregory__ is now known as iurygregory16:58
corvusi think we need a "delete and replace upload" function, and to call it in cases like this.  we probably shouldn't try to repeat a failed upload under the same name without deleting.16:58
clarkbmakes sense particularly for multi part uploads16:59
clarkbit might consider some parts of the upload to be successful even when it shouldn't16:59
clarkbok testinfra tests are running now. The chagne should merge shortly. I'm going to stop zookeeper on zk06 now17:02
clarkbconnections appear to have rebalanced according to stat to both 04 and 05 as expected and 05 remains the leader17:03
opendevreviewMerged opendev/system-config master: Replace zk06 with zk01  https://review.opendev.org/c/opendev/system-config/+/95116417:07
clarkbok the zookeeper deployment job is finally running. This should bring up zk01 in a half working state. I need to restart zk04 and zk05 to pick up the new server listings (in that order so the leader goes last) after the deployment is done17:37
clarkbnodepool zk configs also updated at least on nl05 but services didn't restar there. Once I've got the cluster running in a three node state happily again I plan to restart nodepool services and maybe a zuul merger or two just to sanity check the connectivity to the new shape of the cluster17:38
opendevreviewJeremy Stanley proposed openstack/project-config master: Replace StarlingX's CLA enforcement with the DCO  https://review.opendev.org/c/openstack/project-config/+/95381917:39
fungiildikov: i'll plan on merging that ^ tomorrow17:39
clarkb`echo mntr | nc localhost 2181 | grep followers` shows 2 synced followers with one of them being a non voting follower17:40
clarkbwhcih is what I think we expect since zk01 hasn't been inducted into the cluster yet (need to restart the leader for that)17:40
clarkbok cool zookeeper job completed successfully. I'm going to double check all three nodes have the right server list and that zk05 is still the leader then I'll restart zk04, check the state of things again then restart zk0517:41
clarkbyup this looks good I'm going to restart zk04 now17:42
clarkbI think that went well. zk05 is still the leader and still reports the same follower counts17:44
clarkbok I'm proceeding with a restart of the zk05 leader next17:44
clarkbzk04 is the new leader17:45
clarkbthere are two synced followers and no nonvoting follwoers. I think everything has caught up now17:46
clarkbI need to edit the etherpad because it assumed zk05 would stay the leader but zk04 is leader so things change. THen I need to push up a change to remove zk05 and replace it with zk02 to get that testing then I can restart nodepool and a zuul merger or three17:47
opendevreviewClark Boylan proposed opendev/system-config master: Replace zk05 with zk02  https://review.opendev.org/c/opendev/system-config/+/95382117:51
clarkbinfra-root ^ quick review on that is appricated17:52
clarkbI'm going to restart nodepool services now17:52
clarkbmost new connections are going to zk05 but at least one made it to zk0117:58
clarkbcorvus: do you think it makes sense to restart zuul mergers too to try and get connections onto zk01 or just trust that since the connection count there is 2 that we're probably good enough?18:00
corvusit really does, by pure coincidence, look like about 50% of the openstack node traffic is going to niz, the other is still nodepool.  53% to be more exact.  that's based on the labels we just happened to have switched so far.18:01
clarkbnice18:01
corvusclarkb: maybe try a merger or two?18:01
clarkbon it18:01
corvusjust as a canary in case there's something we're not seeing18:02
corvusi'm going to start merging the label-switch changes to drive more niz traffic.18:02
opendevreviewMerged opendev/zuul-providers master: Switch centos-9-stream from nodepool to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95271518:03
clarkbI did zm01 and zm02. zm01 moved from zk04 to zk05 and zm02 movedfrom zk04 to zk04 (it reconnected to the same one)18:04
clarkbI'll keep doing more to see if I can get at least one to go to zk0118:04
clarkbzm03 connected to zk0118:05
clarkbzm04 connected to zk05. I think thats probably good though. We have nb05 and zm03 talking to zk0118:07
*** iurygregory_ is now known as iurygregory18:07
corvus\o/18:08
opendevreviewJames E. Blair proposed openstack/project-config master: Grafana: update zuul status page for NIZ  https://review.opendev.org/c/openstack/project-config/+/95382318:09
clarkbI've approved https://review.opendev.org/c/opendev/system-config/+/953821 to keep things moving forward18:12
clarkbcorvus: I don't see zk01 in the grafan graphs for zuul. DO you know if I have to do anything special to get that data?18:13
clarkboh I see we filter on those node namse in the graph definition18:13
clarkbI'll just do a followup to update the graphs and not worry about it at the moment18:13
clarkbunless you think that is important18:13
corvusfollowup sounds fine18:14
corvuswe're collecting the data18:14
corvus(it's in graphite)18:14
clarkbcool18:14
corvushttps://opendev.org/airship/promenade/src/branch/master/zuul.d/nodesets.yaml18:19
corvusi wonder what the story is there18:19
clarkbmight be like devstack where they are relying on the groups values to do special stuff in the jobs?18:20
corvusmaybe.  surprised to see a group with the same name as a node though18:20
opendevreviewMerged opendev/zuul-providers master: Switch debian-bookworm to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95271618:21
opendevreviewMerged opendev/zuul-providers master: Switch debian-bullseye-arm64 nodesets to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95271718:21
clarkbcorvus: after I've got zk02 enrolled in the cluster and I've checked things out the next step is to do the out of band update for zuul's zookeeper connection to tell it to connect to all three new nodes (while only two are up and running) then do a zuul restart. I may have questions / ask for help at that point depending on how thinsg are going. Also I wonder if we should restart18:21
clarkbexecutors non gracefully but then do the schedulers one by one to avoid a web ui and event handling outage?18:21
clarkbthis way we only have to do the big zuul restart once18:22
corvusthat plan sgtm18:22
corvuslaunchers one at a time too18:22
clarkback18:22
corvusclarkb: do a hard stop of the executors all at once, that way will minimize jobs rolling over to executors that themselves are about to stop.18:24
clarkbcorvus: best way to do that is with ansible-playbook running zuul_stop.yaml with a limit to the executors?18:24
corvusthat should work.  i have developed a habit of just doing an ad-hoc ansible command that does 'docker-compose down'.  :)18:25
clarkbhapyp to reuse what you know works too :)18:26
clarkbsomething like `ansible -f 20 zuul-executor -m shell -a 'cd /etc/zuul-executor && docker-compose down'` then run a similar docker-compose up -d after18:27
clarkbthat may be best actually because then I can do each class of zuul service one by one more easily and give them the different behavior they want18:28
corvusyep exactly18:28
corvusthat's mostly because i'm too lazy to go look at what the playbook does.  not that i necessarily think it's better.  :)18:29
opendevreviewMerged opendev/system-config master: Replace zk05 with zk02  https://review.opendev.org/c/opendev/system-config/+/95382118:29
clarkbok I'm going to stop zookeeper on zk05 now that ^ has merged18:29
clarkbzk04 remains the leader18:30
opendevreviewMerged opendev/zuul-providers master: Switch ubuntu-jammy-arm64 labels to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95271818:39
opendevreviewMerged opendev/zuul-providers master: Switch to using niz for ubuntu-noble-arm64  https://review.opendev.org/c/opendev/zuul-providers/+/95271918:39
opendevreviewMerged opendev/zuul-providers master: Switch to using niz labels for rockylinux-8  https://review.opendev.org/c/opendev/zuul-providers/+/95272018:39
opendevreviewMerged opendev/zuul-providers master: Switch to using niz for rockylinux-9  https://review.opendev.org/c/opendev/zuul-providers/+/95272118:39
opendevreviewClark Boylan proposed opendev/system-config master: Replace zk04 with zk03  https://review.opendev.org/c/opendev/system-config/+/95382418:45
clarkbI realized I could get things rolling with testing of ^ since I know which server is going to be replaced last now18:45
opendevreviewClark Boylan proposed openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03  https://review.opendev.org/c/openstack/project-config/+/95382518:58
clarkbzookeeper job has finished. I'm going to do my checks across zk01, zk02, and zk04 now. Then will restart zk01 and zk04 once I'm happy19:01
clarkbyup everything looked as I expect it to so I've already restarted zk01 since it was follower. Going to restart zk04 now which is the current leader19:04
clarkbzk02 is the new leader and it has 2 synced followers and no nonvoting followers19:05
clarkbso our cluster is now zk01, zk02, and zk04 with zk04 as leader. Most services are only aware of zk04 as a valid connection point. I'm going to go restart nodepool services and zuul mergers like I did last time now which should have them connect to zk01 and zk02 and zk04 as valid options19:06
clarkbthen we can look into the bigger zuul restart after updating its config19:07
clarkboh sorry zk02 is the leader not zk0419:07
opendevreviewClark Boylan proposed opendev/system-config master: Replace zk04 with zk03  https://review.opendev.org/c/opendev/system-config/+/95382419:10
clarkbthere was a test failure in ^ due to my updating the zk04 node in tests to zk99. Should be fixed now19:10
clarkbnodepool services have restarted I believe nl05 is connected to zk0219:16
clarkbI'm going to run my zuul config update playbook against the zuul mergers and do them in bulk19:17
clarkbok I think that lgtm (zuul-mergers are done and are running with config for all three new servers now)19:22
clarkbcorvus: I think I want to do executors last so that test results for 953824 are not lost and have to restart19:23
corvusack19:23
clarkbcorvus: I'll do the launchers next. Can I hard stop/start them like mergers? looks like yes?19:24
clarkbhow much of a delay between each?19:24
corvusyep... it only takes them like 10 seconds to come up19:25
clarkbgot it. Working on that next19:25
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: WIP: source-repositories: make git retry  https://review.opendev.org/c/openstack/diskimage-builder/+/95382919:27
clarkbzl01 is done. I'll pause for a moment then do zl0219:28
clarkband now zl02 is done19:29
clarkbarg I didn't properly fix the zookeeper tests so I need new test results anyway19:29
clarkbconsidering that I think I'll proceed with executors next now that launchers are done19:29
clarkbI'm going to docker compose down all of them then docker compose up -d all of them19:30
clarkbthat is done19:33
corvusi see nothing exploding on the launchers19:34
clarkbcorvus: you good with me proceeding to do scheduler and web on zuul01 then zuul02 once zuul01 is up again now?19:34
clarkbI've updated configs on zuul01 and zuul02. I'll proceed with restarting scheduler and web services on zuul01 now19:37
corvusyep19:38
clarkbok zuul01 has has its services restarted19:39
clarkbzuul01 web says it is stopped in the components list19:40
clarkbnow it says it is initializing. I just had to wait a bit I guess19:40
opendevreviewClark Boylan proposed opendev/system-config master: Replace zk04 with zk03  https://review.opendev.org/c/opendev/system-config/+/95382419:41
clarkbhopefully that fixes testing19:41
clarkbcorvus: /components says zuul01 scheduler is running now. Thats super quick these days.19:44
corvus:)19:44
clarkband now zuul web is running. I'm goign to do zuul02 nowish19:44
clarkbthe hourly run for zuul will reset the zuul.conf to contain zk01,zk02, and zk04 which is fine since zuul doesn't look at that config except at startup so these restarts should have them all looking at zk01,zk02, and zk03. Then by the time the next zuul restart happens we should have all the zookeepers updated in our inventory19:46
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout  https://review.opendev.org/c/openstack/diskimage-builder/+/72158119:47
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Retry git clone/fetch on timeout  https://review.opendev.org/c/openstack/diskimage-builder/+/72158119:48
clarkb#status log Restarted all of zuul to pick up new zookeeper server configuration19:52
opendevstatusclarkb: finished logging19:52
clarkbwe are now at merge https://review.opendev.org/c/opendev/system-config/+/953824 to remove zk04 and add zk03.19:52
clarkbI think it should pass testing now too :)19:52
opendevreviewMerged openstack/project-config master: Make scripts DCO compliant  https://review.opendev.org/c/openstack/project-config/+/95077019:55
corvusi'm not sure i've seen this keystone error before: https://paste.opendev.org/show/bZd9L8ZsYwelQ1xSz6eY/20:00
corvusi'm not worried about it atm; it looks like it was a one-off.20:00
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Add new-style node labels  https://review.opendev.org/c/opendev/zuul-providers/+/95383220:01
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Switch all nodepool labels to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95383320:01
corvusi'm ready to do that ^ to drive up the niz traffic more.20:01
clarkbre the error I've not seen that one before20:04
clarkbcorvus: fungi https://review.opendev.org/c/opendev/system-config/+/953824 passed its zookeeper test job this time. Can I get a review and/or approval if it looks good?20:06
corvuslgtm, despite the trailing whitespace :)20:07
fricklerhmm, I just noticed that I lost all my bash history of zuul-client commands on zuul02. also "sudo zuul-client autohold-list" first downloaded the container and then fails because the config seems broken. is this expected and I need to switch to something new?20:08
clarkbfrickler: the server was replaced over the weekend20:09
clarkbI suspect that explains both behaviors. Do we need to issue a new auth jwt token on those servers?20:09
corvuszuul-client autohold-list --tenant openstack works for me?20:10
frickleroh, wait, I was missing the --tenant argument. adding that the command works fine indeed. so I just need to re-discover the proper commands I guess20:11
corvusyes, if that argument is omitted, it outputs this error: zuulclient.cmd.ArgumentException: Error: the --tenant argument or the 'tenant' field in the configuration file is required20:11
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Switch all nodepool labels to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95383320:19
corvusokay that's ready, now with fewer config errors :)20:20
clarkbcorvus: and once those go in we shutdown nodepool launchers?20:22
clarkbput another way what is the coordination point for that change20:22
clarkbmaybe zuul simply prefers niz over nodepool?20:22
corvusyeah, zuul prefers niz, no shutdown necessary20:25
corvuswe may still have a few niz requests (cf those last couple of images -- need to check where we are on that)20:26
corvusso next step would be to check the delta again, close the gap by adding any new labels necessary, then we can shutdown nodepool (if desired, but not required).20:27
corvusif anything goes wrong, we can just revert that change and it's instantly back to nodepool.20:27
opendevreviewMerged opendev/zuul-providers master: Add new-style node labels  https://review.opendev.org/c/opendev/zuul-providers/+/95383220:28
opendevreviewMerged opendev/zuul-providers master: Switch all nodepool labels to niz  https://review.opendev.org/c/opendev/zuul-providers/+/95383320:28
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326920:28
corvusyeah those :)20:28
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326920:31
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Drop niz- label prefix from nodesets  https://review.opendev.org/c/opendev/zuul-providers/+/95383520:33
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326920:33
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326920:34
opendevreviewMichal Nasiadka proposed opendev/zuul-providers master: Add Ubuntu bionic/focal builds, labels and provider config  https://review.opendev.org/c/opendev/zuul-providers/+/95326920:36
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Remove "normal" labels, etc  https://review.opendev.org/c/opendev/zuul-providers/+/95383620:44
clarkbI've done a first pass update at our meeting agenda. I dropped the centos 10 dib stuff as I think that is largely done now. I added some notes about zuul launcher status and zookeeper and zuul server replacements21:00
clarkbanything else to add?21:00
corvusthat probably covers all the changes i'd ask to make21:00
funginothing here21:03
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Increase launch-timeout for rax-ord  https://review.opendev.org/c/opendev/zuul-providers/+/95384121:30
corvusi notice that rax-iad is turned off; what's the status with that?21:31
clarkbcorvus: I think that may have gotten lost when I ended up afk for a while21:32
clarkbcorvus: I suspect we can try turning it back on now and see if we get better behavior21:32
corvusack.  maybe leave it be for a bit, until we run out of other node-related things to change.  :)21:34
opendevreviewMerged opendev/system-config master: Replace zk04 with zk03  https://review.opendev.org/c/opendev/system-config/+/95382421:36
clarkbI will shutdown zookeeper on zk04 to get ahead of ^ momentarily21:37
clarkbthats done. zk02 remains leader and has one follower currently21:38
clarkbafter zk03 deploys I'll double check zk01 and zk02 still look good then restart zk01 then zk02 (since 02 is leader it goes last). Then I can restart all of the nodepool services one last time. Then we should be fully on the new cluster and can start looking at followup changes21:40
clarkbzookeeper job has comepleted successfully. I'll check things and then do the necessary restarts22:08
clarkbrestarting zk01 make zk03 the leader22:10
clarkbbut the zxid seems appropriate and zuul is returning expected data so I think zookeeper handled that properly22:10
clarkbI'm just surprised for it to become the leader before I restarted zk0222:11
clarkbzk02 went into a state where it stopped serving requests after that zk01 restart resulted in zk03 being leader. I believe that isb eacuse zk02 was running with a config that didn't allow for zk03. Restarting zk02 to pick up the config change has it working again as a follower22:12
clarkbcorvus: ^ not sure if there is any other checking you want to do. I think this went well. I'm going to work on restarting nodepool services so they are aware of the new (and final) cluster state22:14
corvusclarkb: probably a good idea to do the graphs change now; would be good to double check those22:15
clarkbcorvus: https://review.opendev.org/c/openstack/project-config/+/953825 it is ready if you are22:16
corvusclarkb: i think there's one more; left a comment22:17
clarkbfwiw zk02 reported 2 synced followers one of which was nonvoting prior to the zk01 restart. I think when I restarted zk01 it decided that zk03 was the best leader because zk02 and zk03 were synced and zk03 had the higher id value22:17
clarkbso I think I've convinced myself that this behavior is expected and not abnormal22:17
clarkbbah how did I miss that one22:17
clarkbthen because zk02 didn't have config that knows about zk03 it basically went idle and stopped doing work22:18
clarkbit was a coup amongst the zks22:18
opendevreviewClark Boylan proposed openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03  https://review.opendev.org/c/openstack/project-config/+/95382522:18
clarkbcorvus: ^ there22:19
clarkbnl07 is connected to zk02 (since zk03 became leader lots of stuff ended up over there and it is fine). I just need to finish restarting nl05 and nl06 then I think the application side of this move is done22:23
clarkbnl05 also connected to zk0222:24
clarkball services have been restarted on configs with the three new nodes. All three new nodes are in place and seem to indicate that they are talking to each other via stat and mntr 4 letter commands. Services are stopped on the old servers and they are removed from the inventory22:25
clarkbI think confirming that grafana looks sane is the next step then we can do things like cleanup dns. I'll work on a change for that next22:25
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Remove zk04-zk06 from DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/95384422:29
opendevreviewClark Boylan proposed opendev/system-config master: Cleanup zookeeper config management  https://review.opendev.org/c/opendev/system-config/+/95384622:32
opendevreviewClark Boylan proposed opendev/system-config master: Remove docker compose version from zuul services  https://review.opendev.org/c/opendev/system-config/+/95384822:35
clarkbok last call on the meeting agenda. I'm going to send that out in ~10 minutes22:36
clarkbcorvus: until we have live data we have this screenshot: https://58e37da755b04d7dc54b-c9565b17ee8809c34acc7a1ea3e19ee3.ssl.cf1.rackcdn.com/openstack/e7f70913b8b84f5394eaed9aa16995c6/screenshots/zuul-status.png22:39
corvuszk01 latency looks a little bit higher than ould old values (0.4ms) but probably within the margin of error22:41
corvusi don't see anything else unusual22:41
corvuss/ould/our/22:41
opendevreviewMerged opendev/zuul-providers master: Increase launch-timeout for rax-ord  https://review.opendev.org/c/opendev/zuul-providers/+/95384122:43
corvusthere's a handful of low-priority niz changes at https://review.opendev.org/q/hashtag:opendev-niz+status:open22:44
clarkback I'll review those as soon as I hit send on this meeting agenda22:45
corvuslooks like we're at 87% niz and 13% nodepool22:48
corvusclarkb: since i had a shell open, i ran the cacti graph create script for zk01-zk0322:52
clarkbthanks!22:52
opendevreviewMerged openstack/project-config master: Limit grafana graphs for zookeeper to zk01-03  https://review.opendev.org/c/openstack/project-config/+/95382522:54
clarkbcorvus: all those niz related stats changes and nodeset/label updates lgtm22:54
corvus\o/22:54
clarkbhttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-15m&to=now&timezone=utc seems to have updated now22:57
clarkboh I'm only looking at the last 15 minutes22:57
clarkbyou may wish to chagne the time range if you open that link22:57
corvuslgtm22:58
clarkbI'll probably leave the old servers up with their DNS records in place until at least tomorrow. I think this is a good pause point. THings seem to be working and we can see how zuul handles the periodic job enqueues later this evening before doing more destructive changes to the old servers22:59
corvusclarkb: mnasiadka why is https://review.opendev.org/953269 marked WIP?22:59
clarkbI don't know why22:59
corvusi don't see an explicit event for it23:01
corvushypothesis: it was set by one of the rebase or web-based editing actions, possibly accidentally or without mnasiadka 's knowledge23:02
corvusoh, well it has a depends on anyway23:03
clarkbcorvus: https://zuul.opendev.org/t/openstack/build/9707f99650374c28854400788ebfc0b1/log/job-output.txt#31-5323:09
clarkbcorvus: I suspect that build got a nodeset built from nodes from two different providers23:09
clarkbcorvus: note the Region values in particular (only one provider value is set)23:10
clarkbI noticed bceause the zookeeper test actually checks that external connectivity to the zookeeper non ssl port is blocked and I think it got confused running that test from bridge to the zk test server?23:10
clarkband that resulted in a socket timeout error rather than socket no route to host error23:11
clarkbI think this may impact multinode testing in general since we try to set up vxlan tunnels for that over private ips iirc. But I can't be sure of that without more digging23:11
corvusit does look like that was from two different providers.  the launcher should prefer to use a single provider23:12
corvusbut it is capable of not doing so in extenuating circumstances23:12
corvus(this is a change from nodepool)23:12
corvusi'll try to see what the circumstances were23:13
clarkbright nodepool would only ever provide nodes from the same provider or fail23:13
opendevreviewJeremy Stanley proposed openstack/project-config master: Replace Airship's CLA enforcement with the DCO  https://review.opendev.org/c/openstack/project-config/+/95384923:13
clarkbI see openstack grenade multinode jobs are succeeding in the gate so this isn't a hard fail for everything. That would support the extenuating circumstances theory23:14
corvusclarkb: the node was a leftover ready node from a previous canceled request (probably a previous patchset of the same change).  it looks like we could be better about the interaction of ready nodes and provider preference.23:28
clarkbcorvus: got it. THe other thing I wonder about is why the provider info is lost there23:36
clarkbmaybe we don't carry that forward when reusing recycled nodes?23:36
corvusyeah, an unattached ready node loses its provider info, mostly so that it can float to other providers/tenants/etc that have the same config.  we should re-assign it when we select it.23:37
corvus(it's a weird artifact of having multiple providers with the same endpoint, but they could be different in different tenants)23:38
corvusclarkb: fyi https://review.opendev.org/953851 is the simplest way i can think to improve that.  there's a better way, but it'd make a very complicated method even more complicated, so i'd like to try simple first.  :)23:53
corvusclarkb: i do agree that could end up causing problems for devstack runs, but hopefully not too many before we get that fixed.  i think it's likely to be rare enough we can roll forward instead of back.23:56

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!